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ABSTRACT 

This  dissertation  explores  the  use  of  a  preconditioned  Richardson  iterative  algorithm  for 
the  solution  of  linear  and  nonlinear  ill-posed  integral  equations  of  the  first  kind.  The 
discussion  consists  of  three  parts,  which  can  be  roughly  categorized  as:  numerical  analysis, 
applications  to  statistical  methodology,  and  an  application  to  an  inverse  problem. 

In  the  first  part,  singular  matrix  equations  that  result  from  discretizing  ill-posed  inte¬ 
gral  equations  of  the  first  kind  are  considered.  Sufficient  conditions  for  the  convei&mce 
of  Richardson’s  algorithm  to  a  solution  are  established,  and  necessary  and  sufficient  con¬ 
ditions  are  proved  for  special  cases.  The  inconsistent  case  is  also  discussed.  A  precondi¬ 
tioning  for  equations  with  positive  kernels  leads  to  the  Conditional  Expectation  algorithm , 
which  is  discussed  in  detail.  A  notion  of  ‘iterative  regularization’  is  introduced  and  related 
to  the  more  usual  penalized  least  squares  approach  to  regularization. 

In  the  second  part  two  problems  in  statistical  methodology  are  considered  which  in¬ 
volve  the  solution  of  nonlinear  integral  equations  of  the  first  kind.  The  first  is  the  Behrens- 
Fisher  problem.  Trickett  and  Welch  (Biometrika,  1954)  determined  a  very  nearly  similar 
test  for  the  Behrens-Fisher  problem  having  reasonable  power  by  numerically  'solving'  a 
nonlinear  integral  equation.  The  Trickett- Welch  method  is  examined,  and  a  version  of 
the  Conditional  Expectation  algorithm  for  nonlinear  equations  is  applied  to  the  Behrens- 
Fisher  problem.  The  second  methodological  problem  that  is  considered  is  that  of  /^-content 
tolerance  limits  involving  data  from  a  one-way  balanced  random  effects  model.  The  Con¬ 
ditional  Expectation  algorithm  is  used  to  approximately  solve  a  nonlinear  equation  of 
the  first  kind  numerically,  and  to  thereby  derive  a  new  tolerance  limit  procedure  which 
is  shown  to  be  a  substantial  improvement  over  the  only  other  method  in  the  statistics 
literature. 

In  the  third  part  an  inverse  problem  is  discussed  in  which  the  right  hand  side  of  the 
integral  equation  is  estimated.  In  this  example,  the  objective  is  to  infer  the  probability 
density  of  the  radii  of  random  spheres  in  a  two- phase  medium  from  radii  of  circles  in 
cross-sectional  slices  of  this  medium.  The  Conditional  Expectation  algorithm  leads  to  an 
effective  technique  for  solving  this  problem. 
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Abstract 

This  dissertation  explores  the  use  of  a  preconditioned  Richardson  iterative  algorithm  for 
the  solution  of  linear  and  nonlinear  ill-posed  integral  equations  of  the  first  kind.  The 
discussion  consists  of  three  parts,  which  can  be  roughly  categorized  as:  numerical  analysis, 
applications  to  statistical  methodology,  and  an  application  to  an  inverse  problem. 

In  the  first  part,  singular  matrix  equations  that  result  from  discretizing  ill-posed  inte¬ 
gral  equations  of  the  first  kind  are  considered.  Sufficient  conditions  for  the  convergence 
of  Richardson’s  algorithm  to  a  solution  are  established,  and  necessary  and  sufficient  con¬ 
ditions  are  proved  for  special  cases.  The  inconsistent  case  is  also  discussed.  A  precondi¬ 
tioning  for  equations  with  positive  kernels  leads  to  the  Conditional  Expectation  algorithm, 
which  is  discussed  in  detail.  A  notion  of  ‘iterative  regularization’  is  introduced  and  related 
to  the  more  usual  penalized  least  squares  approach  to  regularization. 

In  the  second  part  two  problems  in  statistical  methodology  are  considered  which  in¬ 
volve  the  solution  of  nonlinear  integral  equations  of  the  first  kind.  The  first  is  the  Behrens- 
Fisher  problem.  Trickett  and  Welch  (Biometrika,  1954)  determined  a  very  nearly  similar 
test  for  the  Behrens- Fisher  problem  having  reasonable  power  by  numerically  ‘solving’  a 
nonlinear  integral  equation.  The  Trickett- Welch  method  is  examined,  and  a  version  of 
the  Conditional  Expectation  algorithm  for  nonlinear  equations  is  applied  to  the  Behrens- 
Fisher  problem.  The  second  methodological  problem  that  is  considered  is  that  of  /3-content 
tolerance  limits  involving  data  from  a  one-way  balanced  random  effects  model.  The  Con¬ 
ditional  Expectation  algorithm  is  used  to  approximately  solve  a  nonlinear  equation  of 
the  first  kind  numerically,  and  to  thereby  derive  a  new  tolerance  limit  procedure  which 
is  shown  to  be  a  substantial  improvement  over  the  only  other  method  in  the  statistics 
literature. 

In  the  third  part  an  inverse  problem  is  discussed  in  which  the  right  hand  side  of  the 
integral  equation  is  estimated.  In  this  example,  the  objective  is  to  infer  the  probability 
density  of  the  radii  of  random  spheres  in  a  two- phase  medium  from  radii  of  circles  in 
cross-sectional  slices  of  this  medium.  The  Conditional  Expectation  algorithm  leads  to  an 
effective  technique  for  solving  this  problem. 


Contents 


1  Ill-Posed  Integral  Equation  Problems  in  Statistics  1 

1.1  Introduction .  1 

1.2  The  Ill-Posed  Nature  of  Integral  Equations  of  the  First  Kind  .  2 

1.2.1  Classification  of  Integral  equations .  2 

1.2.2  Ill-Posed  Problems .  3 

1.2.3  Near  Solutions  and  Near  Convergence  of  Ill-Posed  Integral  Equa¬ 
tions  of  the  First  Kind .  3 

1.3  An  Example:  Deflection  of  a  Simply  Supported  Beam  .  4 

1.4  Integral  Equations  of  the  First  Kind  in  Statistics .  5 

1.5  An  Outline  of  the  Remaining  Chapters .  5 

2  A  Review  of  Background  Material  from  Linear  Algebra,  Functional 

Analysis,  Probability,  and  Statistics  8 

2.1  Matrix  Algebra .  8 

2.1.1  Elementary  Notions .  8 

2.1.2  The  Singular  Value  Decomposition .  9 

2.1.3  The  Jordan  Canonical  Form .  10 

2.1.4  Matrices  Having  a  Diagonalizable  Nullspace .  11 

2.1.5  Congruence .  12 

2.1.6  Nonnegative  Matrices .  12 

2.2  Matrix  Analysis .  14 

2.2.1  Matrix  Norms .  14 

2.2.2  Convergent  Matrices .  15 

2.2.3  Condition  Numbers  .  16 

2.3  Elementary  Notions  of  Functional  Analysis  .  16 

2.3.1  Normed  Vector  Spaces .  17 

2.3.2  Hilbert  Space .  19 

2.3.3  Linear  Operators .  21 

2.3.4  Orthogonal  Complements  in  Hilbert  Space .  22 

2.3.5  Compact  Linear  Operators  .  24 

2.4  Fredholm  Integral  Equations  of  the  First  Kind  in  L2 .  25 

2.4.1  Existence  and  Uniqueness  of  Solutions  of  Linear  Operator  Equations  27 

2.4.2  Infinite  Rank  Compact  Operator  Equations  of  the  First  Kind  are 

HI- Posed .  27 

2.5  Probability  Theory . :..... .  28 

2.5.1  Probability  Spaces .  29 

2.5.2  Random  Variables  and  Probability  Distributions .  29 


2.5.3  Expectation  and  Moments .  31 

2.5.4  Conditional  Probability  and  Independence .  31 

2.5.5  Some  Distribution  Theory  for  Statistics .  33 

2.5.6  Some  Limit  Theorems .  36 

2.6  A  Decision-Theoretic  Approach  to  Estimation  and  Hypothesis  Testing  .  .  37 

2.6.1  Decision-Making  Under  Uncertainty .  37 

2.6.2  Admissibility  and  Bayes  Risk .  38 

2.6.3  Philosophies  of  Inference .  39 

2.6.4  Estimation  .  40 

2.6.5  Hypothesis  Testing .  40 

2.6.6  Confidence  Intervals .  43 

2.6.7  Examples  of  Integral  Equations  in  Estimation  and  Hypothesis  Testing  44 

3  Richardson’s  Algorithm,  Preconditioning,  and  Iterative  Regularization  47 

3.1  Richardson’s  Algorithm .  47 

3.1.1  Convergence  of  the  Richardson  and  Landweber  Algorithms  in  L 2  .  48 

3.1.2  Richardson’s  Algorithm  for  Matrix  Equations .  48 

3.2  Stochastic  Preconditioning  and  the  Conditional  Expectation  Algorithm  .  61 

3.2.1  A  Property  of  Positive  Definite  Preconditioning  Matrices  .  61 

3.2.2  Positive,  Bounded  Kernels  and  Stochastic  Matrices .  62 

3.2.3  Some  Heuristic  Motivations  for  Stochastic  Preconditioning .  63 

3.3  Richardson’s  Algorithm  and  Iterative  Regularization .  65 

3.3.1  Regularization  Methods .  66 

3.3.2  Regularization  Implicit  in  Richardson’s  Algorithm .  66 

3.3.3  Positive  Definite  I\ .  68 

3.4  Examples .  68 

3.4.1  A  Fredholm  Example  .  68 

3.4.2  A  Volterra  Example .  72 

4  The  Conditional  Expectation  Algorithm  for  Nonlinear  Integral  Equa¬ 
tions  with  Peaked  Kernels  81 

4.1  A  Nonlinear  Equation .  81 

4.1.1  Peaked  Kernels .  81 

4.2  Newton’s  Method .  82 

4.2.1  The  Frechet  Derivative .  82 

4.2.2  The  Newton- Step  Equation .  83 

4.3  The  Conditional  Expectation  Algorithm .  84 

4.3.1  A  Simple  Case .  84 

4.3.2  An  Example  for  Which  t(x)  ^  x  and  4>(x,y)  ^  y .  85 

4.3.3  The  General  Case  .  86 

4.4  A  Simple  Numerical  Nonlinear  Example .  86 

5  The  Behrens-Fisher  Problem  91 

5.1  Historical  Background .  91 

5.2  The  Trickett- Welch  Approach .  92 

5.2.1  The  Trickett- Welch' Equation .  92 

5.3  Quasi-Newton  Methods  and  the  Trickett- Welch  Algorithm .  95 

5.3.1  Newton’s  Method . 95 

ii 


5.3.2  Quasi-Newton  Procedures .  95 

5.4  A  Numerical  Example .  97 

5.5  The  Power  of  Tests  for  the  Behrens- Fisher  Problem .  98 

6  One-Sided  Tolerance  Limits  for  a  One-Way  Balanced  Random- Effects 

ANOVA  Model  107 

6.1  Other  Applications  of  Iterative  Algorithms .  107 

6.2  The  Tolerance  Limit  Problem .  107 

6.3  The  One-Way  Balanced  Random-Effects  Model .  108 

6.4  An  Exact  Solution  for  Known  r .  110 

6.5  The  Solution  for  Unknown  r:  A  Welch-Aspin  Type  Asymptotic  Expansion  112 

6.5.1  A  Simple,  Accurate  Tolerance  Limit  Factor  Based  on  an  Asymptotic 

Expansion .  116 

6.6  The  Conditional  Expectation  Tolerance  Limit  Factor .  117 

6.6.1  Polynomial  Approximations  to  the  Integral  Equation  Solutions  .  .  119 

6.7  The  Distributions  of  the  Tolerance  Limits .  120 

6.8  Discussion .  120 

6.9  Examples .  125 

7  An  Ill-Posed  Inverse  Problem  in  Stereology  135 

7.1  Ill-Posed  Inverse  Problems  in  Applied  Science .  135 

7.2  The  Random  Sphere  Problem .  135 

7.3  Singular  Integral  Equations  of  Abel  Type .  136 

7.4  The  Wicksell  Solution  to  the  Random  Sphere  Problem .  137 

7.5  The  Conditional  Expectation  Algorithm  for  the  Random  Sphere  Problem  138 

7.5.1  Density  Estimation  Issues .  140 

7.6  Numerical  Examples .  140 

7.6.1  Sphere  Radius  Density  /(r)  =  6r(l  -  r) .  141 

7.6.2  Sphere  Radius  Density  f(r)  —  Gr(l  —  r):  Sampling  from  g(x)  .  .  .  141 

7.6.3  A  Real  Data  Example:  Liver  Cell  Nuclei .  142 

8  Conclusions  155 

A  A  Setup  for  Numerical  Problems  162 

A.l  The  General  Setup .  162 

A.2  A  Linear  Fredholm  Example .  164 

A.3  A  Linear  Volterra  Example .  164 

B  5  Code  for  Conditional  Expectation  Algorithm  and  Richardson  Algo¬ 
rithm  for  the  Green’s  Function  Kernel  166 

C  S  Code  for  The  Trickett- Welch  Algorithm  and  Conditional  Expectation 
Algorithms  for  the  Behrens-Fisher  Problem  168 

C.l  The  Trickett- Welch  Algorithm .  168 

C.2  The  Conditional  Expectation  Algorithm .  170 

D  S  Code  for  The  Coditional  Expectation  Algorithm  for  the  Tolerance 
Limit  Problem  172 

iii 


% 


E  5  Code  for  Conditional  Expectation  Algorithm  for  Random  Sphere 
Problem  ^ 


IV 


Acknowledgements 


I  am  indebted  to  many  individuals  who  helped,  either  directly  or  indirectly,  to  make  this 
thesis  possible.  Most  important  of  all  was  the  guidance  of  my  advisor,  Professor  Hernam 
Chernoff  of  the  Department  of  Statistics,  who  has  provided  encouragement  and  detailed 
criticism  for  almost  four  years.  Like  many  theses,  this  document  is  the  tip  of  an  iceberg. 
Many  ideas  and  applications  which  do  not  appear  in  the  final  document  were  explored  in 
detail,  and  this  required  considerable  time  and  patience  on  the  part  of  my  advisor,  who 
was  working  with  me  in  an  area  not  directly  related  to  his  own  research  interests.  He 
taught  me  a  great  deal  about  posing  questions,  seeking  answers,  and  presenting  results. 
He  worked  with  me  extensively  during  the  last  year,  even  though  he  was  on  sabbatical 
leave,  and  without  this  effort  on  his  part  I  would  have  not  graduated  this  June.  Working 
with  Professor  Chernoff  was  the  most  rewarding  educational  experience  that  I  have  had, 
and  I  was  always  be  grateful  to  him  for  it. 

Professor  Donald  Anderson  of  the  Division  of  Applied  Sciences,  though  formally  a 
second  reader,  voluntarily  took  on  the  unofficial  role  of  ‘second  advisor’.  During  the 
1991-1992  academic  year,  while  Professor  Chernoff  was  on  sabbatical  leave,  I  met  with 
Professor  Anderson  nearly  every  week.  He  took  an  active  interest  in  my  research,  and 
provided  detailed  comments  at  every  step  in  the  final  year  of  thesis  writing.  Through 
his  criticism  of  my  often  clumsy  attempts  to  prove  the  principal  theoretical  results  of 
this  thesis.  Professor  Anderson  patiently  taught  me  a  considerable  amount  of  functional 
analysis  and  linear  algebra,  providing  a  foundation  which  I  hope  to  build  on  in  the  years 
to  come. 

During  the  years  of  research  and  writing,  I  worked  full  time  as  a  statistician  at  the 
Army  Materials  Technology  Laboratory  in  Watertown,  Massachusetts.  Obviously,  I  could 
not  have  both  fulfilled  my  duties  to  this  organization  and  written  a  thesis  simultaneously 
without  considerable  encouragement  and  support  from  individuals  at  the  Laboratory.  I 
would  like  to  single  out  for  special  thanks  among  these  individuals  Donald  Neal,  who 
always  believed  in  what  I  was  capable  of  doing,  even  at  times  when  I  had  serious  doubts, 
and  Colin  Freese,  who  has  been  my  friend  and  mentor  for  over  a  decade. 

I  will  never  forget  the  students  of  the  Statistics  department,  who  together  make  a 
friendly,  supportive  community  within  which  it  was  a  pleasure  to  work  and  learn.  Among 
these  students,  Pat  Meehan,  Connie  Brown,  Tom  Blackwell,  Chris  Schmid,  Andrew  Gel- 
man,  and  Xiao- Li  Meng  come  to  mind  as  friends  who  always  provided  enthusiastic  support 
and  encouargement,  without  which  I  might  not  have  made  it  through. 

Finally,  I  would  like  to  thank  my  parents,  who  taught  me  throughout  my  life  that  the 
opportunity  to  learn  is  a  privilege,  and  that  the  fulfillment  of  ones  potential  to  learn  is  of 
primary  importance  in  life.  This  thesis  is  dedicated  to  them. 


Chapter  1 

Ill-Posed  Integral  Equation 
Problems  in  Statistics 


1.1  Introduction 


Many  problems  of  interest  either  in  mathematical  statistics  or  in  applications  can  be  for¬ 
mulated  as  integral  equations.  We  are  concerned  in  this  dissertation  with  the  common 
situation  where  the  integral  equation  is  ill-posed.  It  is  the  nature  of  an  ill-posed  prob¬ 
lem  that  slight  changes  in  given  functions  or  data  cause  large  changes  in  the  solution. 
Typically,  even  changes  due  to  discretization  or  roundoff  error  in  the  computer  represen¬ 
tation  of  a  function  can  cause  instability  when  attempts  are  made  to  solve  the  problem 
numerically. 

The  main  objective  of  this  thesis  is  to  indicate  how  a  simple  iterative  method,  with 
an  appealing  probabilistic  interpretation,  can  be  used  for  the  numerical  solution  of  what 
are  generally  perceived  to  be  difficult  integral  equation  problems. 

Let  the  (possibly  nonlinear)  integral  equation  to  be  solved  be 


/  k{x,y,f[4>(x,y)]}dy  =  g(x),  (1.1) 

Jo 

where  k,  4>,  and  g  are  known  functions.  The  linearization  of  this  integral  equation,  which 
follows  from  the  Frechet  derivative  of  the  nonlinear  integral  operator,  is  a  linear  integral 
equation  with  kernel  equal  to  the  derivative  of  k  with  respect  to  its  third  argument,  which 
we  will  denote  as  k'(x,y,f). 

Let  f°  be  a  first  approximation  to  a  solution  of  (1.1);  often  we  will  choose  f°  =  0. 
One  form  of  the  iteration  that  we  will  propose  relates  /n+1  to  /"  by 


rn+i  =  fn  ,  9- So  k{x,y,fn)dy 
/o1  k'(x,y,fn)dy 


(1.2) 


An  example  of  a  problem  which  leads  to  an  integral  equation  of  the  form  (1.1)  which 
does  not  have  a  solution  in  the  usual  sense  of  the  word,  but  which  can  be  easily  treated 
numerically  by  the  iteration  (1.2),  is  the  Behrens- Fisher  problem.  Trickett  and  Welch 
(1954)  apply  an  iteration,  which  can  be  regarded  as  an  approximation  to  (1.2),  to  the 
nonlinear  integral  equation  formulation  of  this  classical  problem  with  amazingly  good 
results.  For  sample  sizes  nj  =  n2  =  20,  Trickett  and  Welch  provide  details  of  hand 
calculations  of  five  iterations  which  lead  to  a  smooth  critical  value  statistic  which  provides 
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a  test  differing  from  the  nominal  size  by  no  more  than  ±  . 000002 ,  regardless  of  the  value 
of  the  variance  ratio.  We  will  discuss  the  Behrens- Fisher  problem  from  the  point  of  view 
of  integral  equations  in  Chapter  5. 

Marie  and  Graybill  (1979),  independently  of  Trickett  and  Welch  (1954),  applied  the 
same  algorithm  to  a  variant  of  the  Behrens-Fisher  problem.  Wang  (1989),  using  the 
Trickett- Welch  approach,  iteratively  solved  a  ^-expectation  tolerance  limit  problem  for 
a  normal  random-effects  model.  In  Chapter  6,  we  discuss  the  solution  of  Vangel  (1987, 
1990, 1992)  to  a  normal  random  effects  model  /3-content  tolerance  limit  problem,  a  problem 
which  can  also  be  formulated  as  a  nonlinear  integral  equation. 

In  all  four  of  these  cases,  the  authors  use  iterative  algorithms  to  ‘solve’  nonlinear  ill- 
posed  problems  numerically,  problems  which  have  long  been  known  to  most  likely  possess 
either  no  solutions,  or  else  only  pathological  solutions  (Linnik,  1968).  It  is  also  significant 
that  in  none  of  the  above  articles  is  there  a  single  mention  of  the  ill-posed  nature  of  the 
problems  being  treated  numerically. 

On  the  other  hand,  in  the  current  literature  on  ill-posed  integral  equation  problems 
iterative  algorithms  are  scarcely  mentioned.  Regularization  methods  dominate  this  land¬ 
scape.  The  usual  regularization  methods  (see,  e.g.,  Tikhonov  and  Arsenin,  1977)  introduce 
a  penalty  term  which  causes  a  solution  to  be  more  or  less  smooth  depending  on  the  value 
of  a  parameter.  Since  any  linear  smoother  solves  a  certain  penalized  least  squares  problem 
(Hastie  and  Tibshirani,  1990,  p.72),  there  is  regularization  implicit  in  using  an  iterative 
method  on  a  problem  in  which  the  kernel  acts  as  a  smoother,  a  point  which  we  will  take 
up  in  Chapter  3. 

In  addition  to  interesting  problems  in  mathematical  statistics,  the  methods  of  this 
thesis  may  prove  useful  in  the  solution  of  many  ill-posed  problems  in  applied  statistics. 
Although  our  emphasis  will  be  on  problems  for  which  £  m  (1.1)  is  known  without  error, 
we  will  also  consider,  in  Chapter  7,  a  classical  inverse  problem  of  stereology  where  g  is 
either  a  function  observed  with  error,  or  else  an  estimate  of  a  probability  density. 

1.2  The  Ill-Posed  Nature  of  Integral  Equations  of  the  First 
Kind 

In  this  section,  we  introduce  some  terminology  from  the  theory  of  integral  equations,  and 
we  discuss  the  concept  of  an  ill-posed  problem.  A  review  of  the  classical  theory  of  integral 
equations  of  the  first  kind  along  with  a  discussion  of  the  ill-posedness  of  these  integral 
equations  appears  in  Chapter  2. 

1.2.1  Classification  of  Integral  equations 

A  linear  Fredholm  integral  equation  of  the  first  kind  is  an  equation  of  the  form 

I  k{x,y)f{y)dy  =  g(x),  (1.3) 

Jo 

where  fc,  /  and  g  are  functions  in  L2.  The  function  k(x,  y )  is  called  the  kernel  of  the  integral 
equation.  For  nonlinear  integral  equations  (e.g.,  1.1),  the  kernel  is  also  a  function  of  the 
unknown  /.  All  of  the  nonlinear  examples  which  we  will  consider  are  special  cases  of  the 
equation 

/  k{x,y,  /[</>(-. ,  y)\}dy  =  g(x),  (1.4) 

Jo 
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where  k,  g ,  and  <f>  are  known  functions. 

The  general  linear  Fredholm  equation  of  the  second  kind  is 


g(x)  +  A  f  k(x,  y)f(y)dy  =  /(x), 
Jo 


(1.5) 


where  A  is  a  constant.  The  methods  to  be  discussed  in  this  thesis  are  also  applicable  to 
the  second  kind  equation  (1.5).  However,  we  will  not  consider  the  second  kind  equation 
further  since  it  is  generally  well  posed  (see  section  1.2.2  below)  and  more  efficiently  solved 
by  methods  which  exploit  the  special  structure  of  second  kind  equations  (the  classical 
Fredholm  theorems,  see,  e.g.,  Smithies,  1958). 

If  the  upper  limits  in  the  above  integrals  are  replaced  by  x,  then  these  equations 
become  equations  of  the  Volterra  type.  A  linear  Volterra  equation  of  the  first  kind 


j; 


Hx,y)f(y)dy  =  9(*), 


(1.6) 


w*.  »)-{,“*• 


y)  if  y  <  x 
if  y  >  x 


(1.7) 


(1.8) 


can  be  regarded  as  a  Fredholm  equation  with  the  kernel 

k. 

Alternatively,  with  the  change  of  variable  y  =  xw,  (1.6)  becomes 

x  f  k(x,xw)f(xw)dw  =  g{x), 

Jo 

an  equation  with  constant  limits  of  integration. 

1.2.2  Ill-Posed  Problems 

Hadamard  originated  the  classification  of  inverse  problems  as  well-  and  ill-posed;  a  general 
discussion  appears  in  Tikhonov  and  Arsenin  (1977,  pp.  7-8).  We  consider  here  the 
nonlinear  operator  equation  A(f)  =  g,  where  /  is  to  be  found  in  terms  of  given  data  g.  If 
A(f)  =  g  for  some  function  /,  then  we  write  /  =  A-(j).  The  problem  of  determining  / 
is  well  posed  if  the  following  three  conditions  are  satisfied: 

1.  For  every  g  there  exists  a  solution  /, 

2.  this  solution  is  unique,  and 

3.  the  inverse  operator  A~  is  continuous. 

Problems  which  do  not  satisfy  all  of  these  conditions  (particularly  condition  (3))  are  said 
to  be  ill-posed. 


1.2.3  Near  Solutions  and  Near  Convergence  of  Ill-Posed  Integral  Equations 
of  the  First  Kind 

When  treating  an  ill-posed  integral  equation  of  the  first  kind  numerically,  we  are  usually 
not  interested  in  obtaining  an  exact  ‘solution’,  because  a  solution  which  corresponds  to 
exactly  the  right  hand  side  in  a  numerical  representation  of  the  integral  equation  can  be 
very  different  from  a  solution  to  the  original  functional  equation.  The  reason  for  this  is 


that  a  representation  of  the  right  hand  side  on  a  computer  will  always  differ  (because 
of  discretization  error,  roundoff  error,  and  possibly  noise)  from  the  true  function  g.  We 
therefore  introduce  the  notion  of  a  near-solution  for  a  smooth,  well  behaved  function  which 
results  in  a  right  hand  side  close  to  the  actual  right  hand  side.  An  iterative  algorithm 
which  results  in  near-solutions  after  a  moderate  number  of  iterations  will  sometimes  be 
referred  to  as  nearly  convergent.  The  iterative  algorithms  discussed  in  this  thesis  can 
produce  near-solutions  in  practice,  even  when  the  matrix  discretization,  or  perhaps  the 
original  integral  equation,  has  no  solution.  In  practice,  one  stops  after  at  most  a  few  dozen 
iterations.  In  theory,  one  considers  infinitely  many  iterations,  and  the  nearly  convergent 
algorithm  will  either  converge,  possibly  to  an  exact  solution  (which  is  likely  not  to  be 
smooth)  or  else,  in  the  inconsistent  case,  the  iteration  diverges. 


1.3  An  Example:  Deflection  of  a  Simply  Supported  Beam 

We  present  the  fallowing  example  both  to  illustrate  the  ill-posed  nature  of  the  integral 
equation  of  the  first  kind  and  to  introduce  a  very  simple  integral  equation  which  will  be 
referred  to  repeatedly  in  later  chapters  as  a  model  problem.  The  problem  introduced  here 
is  widely  used  as  an  example  in  the  literature  on  numerical  methods  for  integral  equations 
of  the  first  kind. 

Consider  a  thin,  elastic  beam  of  unit  length  ‘hinged’  at  the  ends  so  that  bending 
moments  cannot  be  transmitted  from  the  supports.  Let  a  continuous  force  be  applied 
perpendicular  to  the  beam,  and  let  this  force  as  a  function  of  position  be  denoted  f(x). 
The  relationship  between  /  and  the  displacement  g  that  it  causes  is,  for  an  appropriate 
choice  of  material  constants, 


/  k(x,y)f(y)dy  =  g(x), 
Jo 

where  the  kernel  k  (a  Green’s  function)  is 


k(x,y) 


y(l  -  x)  if  y  <  x 
x(l  -  y)  if  y  >  x. 


(1.9) 


(1.10) 


The  integral  equation  (1.9)  is  equivalent  to  the  boundary  value  problem 

1?  +  /{X)  =  0' 


and  its  solution  is 


<7(0)  =  5(1)  =  0, 


(see,  e.g.,  Tricomi,  1957,  pp.  116-117). 

We  consider  here  right  hand  sides  of  the  form 


The  Li  norm  of  g\  is 


9i(x)  =  9o(x)(l  +  sin(7r/x)//). 

11*11  <  M(1  +  0(1//)), 


(1.11) 


(1.12) 


(1.13) 

(1.14) 
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so  by  making  /  large  enough  we  have,  for  arbitrary  positive  €,  that 


Mali  -  Halil  <  <• 


(1.15) 


The  solution  to  (1.9),  however,  is 

fl(x)  =  -5o(x)[l +sin(jr/i)/l] +  5o(x)^2sin(jrlx)  (1.16) 

-  2Trg'0(x)cos(irlx), 

and  as  /  — ♦  oo,  the  difference  in  norms  |||//||  -  ||/o|||  is  unbounded. 

1.4  Integral  Equations  of  the  First  Kind  in  Statistics 

There  are  several  sources  of  integral  equations  of  the  first  kind  in  statistics.  These  include 
problems  of 

1.  unbiased  estimation, 

2.  estimating  a  prior  distribution  on  a  parameter  given  the  marginal  distribution  of 
the  data  and  the  likelihood, 

3.  similar  tests  for  norma]  theory  problems,  and 

4.  inverse  problems  of  indirect  measurement. 

We  will  introduce  1)  and  3)  in  Chapter  2,  with  detailed  discussion  of  particular  examples 
of  3)  (the  Behrens- Fisher  problem  and  a  tolerance  limit  problem)  to  follow  in  Chapters  5 
and  6.  An  inverse  problem  of  stereology  provides  an  example  of  4)  which  we  will  consider 
in  Chapter  7.  The  empirical  Bayes  problem  of  estimating  a  prior  distribution  is  formally 
very  much  like  1),  and  we  will  not  discuss  this  problem  in  this  thesis. 


1.5  An  Outline  of  the  Remaining  Chapters 

Chapter  2  consists  of  review  material  from  linear  algebra,  matrix  analysis,  functional 
analysis,  probability,  statistics,  and  the  theory  of  linear  operator  equations  of  the  first 
kind.  Most  readers  will  find  some  of  this  material  helpful,  although  probably  no  one  will 
find  all  of  this  material  new.  Because  this  thesis  is  partly  numerical  analysis  and  partly 
statistics,  it  is  necessary  to  consider  readers  from  each  of  these  fields  who  might  not  have 
a  strong  background  in  the  other  discipline. 

Chapter  3  contains  most  of  the  theoretical  discussion  of  this  thesis.  We  begin  by 
introducing  the  Richardson  and  preconditioned  Richardson  iterative  algorithms  for  linear 
operator  equations  of  the  first  kind.  We  then  briefly  review  the  literature  on  convergence 
of  some  basic  iterative  algorithms  in  L?. 

This  thesis  is  concerned  almost  exclusively  with  matrix  equations  which  arise  from  the 
discretization  of  integral  equations.  Since  the  integral  equations  which  we  shall  consider 
are  ill-posed,  the  matrix  equations  which  result  from  discretizations  will  usually  be  nu¬ 
merically  singular,  and  often  also  inconsistent.  Even  though,  because  of  roundoff  error, 
the  discretizations  will  almost  never  be  exactly  singular,  the  study  of  the  singular  case 
helps  throw  fight  on  the  situation  where  one  has  almost  singular  matrices,  and  on  the 
original  problem  in  function  space,  where  one  can  have  nonuniqueness  or  inconsistency. 
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In  Chapter  3,  we  establish  a  sufficient  condition  for  convergence  of  Richardson’s  algorithm 
for  consistent  matrix  equations,  and  we  also  prove  that  the  conditions  are  necessary  for 
an  important  class  of  problems.  We  also  discuss  the  inconsistent  case  qualitatively,  and 
argue  that  the  proposed  algorithms  are  robust  with  respect  to  moderate  violation  of  the 
consistency  assumption. 

The  proposed  iterative  algorithms  tend  to  produce  smooth  approximate  solutions; 
hence  there  is  regularization  implicit  in  using  these  iterative  methods.  In  Chapter  3,  we 
introduce  the  notion  of  iterative  regularization  and  relate  it  to  penalized  least  squares. 

Although  Richardson’s  algorithm  tends  to  produce  smooth  near- solutions  in  many 
situations,  this  algorithm  can  converge  very  slowly.  The  objective  of  preconditioning  is 
to  produce  a  modified  algorithm  which  converges  more  rapidly.  We  examine  a  form  of 
preconditioning  of  the  Richardson  iterates  for  matrix  equations  with  positive  matrices. 
This  preconditioning  consists  of  operating  on  both  sides  of  the  equation  on  the  left  so  as 
to  make  the  matrix  stochastic,  hence  the  name  stochastic  preconditioning.  We  use  the 
Perron-Frobenius  theory  of  positive  matrices  to  suggest  under  what  conditions  our  pro¬ 
posed  preconditioning  can  be  expected  to  work  well.  Several  heuristic  motivations  are  also 
provided;  one  probabilistic  motivation  leads  to  the  suggested  name  Conditional  Expecta¬ 
tion  Algorithm  for  the  proposed  preconditioned  Richardson  algorithm  and  its  nonlinear 
generalizations. 

Chapter  3  concludes  with  linear  Fredholm  and  Volterra  examples.  A  careful  discussion 
of  the  discretization  process  is  given  in  an  appendix,  so  that  all  of  the  numerical  examples 
in  this  thesis  can  be  readily  duplicated  and  extended  by  the  interested  reader. 

In  Chapter  4,  we  consider  nonlinear  equations.  This  is  a  short  chapter  which  serves 
mostly  to  establish  notation  and  to  generalize  the  Conditional  Expectation  algorithm, 
introduced  only  for  linear  problems  in  the  previous  chapter,  to  nonlinear  integral  equations 
of  the  first  kind. 

In  Chapter  5,  we  begin  the  discussion  of  applications  to  statistics  by  reviewing  the 
Behrens-Fisher  problem,  with  an  emphasis  on  on  the  Trickett  and  Welch  (1954)  solution. 
Most  of  the  results  of  this  chapter  are  not  new,  but  the  perspective  on  the  problem  is. 
We  are  as  concerned  with  the  method  of  solution  as  with  the  results.  Also,  unlike  Trickett 
and  Welch,  we  are  aware  of  Linnik’s  (1968)  demonstration  that  only  pathological  exact 
solutions  exist.  The  algorithm  which  Trickett  and  Welch  use,  with  much  success,  can 
be  regarded  as  a  very  good  approximation  to  a  Conditional  Expectation  algorithm.  In 
fact,  the  differences  between  the  iterates  produced  by  the  Trickett- Welch  and  Conditional 
Expectation  algorithms  are  negligable.  However,  the  Conditional  Expectation  algorithm 
can  work  in  situations  where  the  Trickett- Welch  approach  is  not  useful,  as  we  show  in 
Chapter  6. 

In  Chapter  6,  we  discuss  one-sided  /1-content  tolerance  limits  for  a  normal  population 
with  two  components  of  variance  estimated  by  data  from  a  one-way  balanced  random- 
effects  ANOVA  model.  By  numerically  approximating  the  solution  to  a  nonlinear  integral 
equation  using  a  Conditional  Expectation  algorithm,  we  develop  a  tolerance  limit  proce¬ 
dure  which  provides  the  appropriate  confidence  level  almost  independently  of  the  unknown 
ratio  of  within-  to  between-gToup  variances.  It  is  very  likely  the  case  that,  as  with  the 
Behrens-Fisher  problem,  this  tolerance  limit  problem  has  either  none  or  else  only  patho¬ 
logical  exact  solutions.  However,  by  numerically  ‘solving’  an  integral  equation  of  the  first 
kind,  using  the  Conditional  Expectation  algorithm,  we  obtain  near  solutions  and  are  able 
to  develop  a  method  which  represents  a  substantial  improvement  over  the  Mee-Owen 
(1983)  approach,  which  is  the  only  competing  procedure  in  the  statistics  literature.  For 
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ease  of  computation,  we  provide  coefficients  for  polynomials  fit  to  the  integral  equation 
solutions  for  two  important  cases.  We  also  suggest  another  very  simple  alternative  to  the 
Mee-Owen  method. 

In  Chapter  7,  we  discuss  an  interesting  example  of  an  ill-posed  inverse  problem.  Con¬ 
sider  a  two-phase  medium  where  the  first  phase  consists  of  spherical  inclusions  of  random 
radius  randomly  distributed  in  a  second  phase.  The  radii  of  these  spheres  are  assumed  to 
follow  a  probability  distribution  which  has  a  density,  and  we  would  like  to  estimate  this 
density.  The  available  data  are  circle  radii  measured  on  cross-sections  of  the  material. 
The  density  of  the  circle  radii  is  related  to  the  density  of  the  sphere  radii  by  an  Abel 
integral  equation  of  the  first  kind.  This  problem  of  indirect  measurement  is  typical  of 
the  inverse  problems  of  stereology,  the  science  of  inferring  higher  dimensional  structure 
from  lower  dimensional  data.  In  this  chapter,  we  derive  the  Abel  equation  (first  reported 
in  Wicksell  (1925))  and  briefly  review  the  extensive  literature  on  this  problem.  We  then 
proceed  to  apply  the  Conditional  Expectation  algorithm  in  order  to  develop  a  method  for 
solving  this  equation.  This  apparently  new  approach  is  demonstrated  on  both  simulated 
and  real  data.  For  the  simulated  data,  we  consider  both  the  case  where  the  density  of 
the  circle  radii  is  a  function  observed  with  noise,  and  where  the  circle  radius  density  is 
estimated  by  a  sample  from  this  probability  density. 
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Chapter  2 


A  Review  of  Background 
Material  from  Linear  Algebra, 
Functional  Analysis,  Probability, 
and  Statistics 


2.1  Matrix  Algebra 

We  review  here  those  concepts  from  matrix  algebra  which  will  be  used  in  this  thesis.  We 
assume  familiarity  with  topics  generally  covered  in  a  first  course  in  this  subject,  although 
we  will  briefly  review  some  of  these  ideas  (eigenvalue,  similarity,  etc.)  for  completeness. 
The  definition  of  a  vector  space,  and  a  discussion  of  the  important  notions  of  range  and 
nullspace  are  deferred  until  Section  2.3,  where  we  take  these  topics  up  in  a  more  general 
Hilbert  space  setting. 

In  this  thesis,  we  will  denote  m-dimensional  complex  Euclidean  space  by  Cm,  and 
m-dimensional  real  Euclidean  space  by  7 Zm. 

2.1.1  Elementary  Notions 

Let  A  be  an  arbitrary  m  y  n  matrix  with  elements  a,,  6  C.  The  entry  in  the  ith  row 
and  j th  column  of  A  is  a,j,  and  we  write  this  as  Aij  =  o,j.  The  transpose  of  A,  AT,  has 
typical  element  A  Jj  =  ajt,  and  the  adjoint  of  A,  Am,  has  typical  element  A’j  =  dji,  where 
the  overbar  denotes  complex  conjugation.  If  A  =  AT,  then  A  is  symmetric,  if  A  =  A *, 
then  A  is  Hermitian.  The  rank  of  a  matrix  A  is  the  number  of  linearly  independent  rows, 
which  equals  the  number  of  linearly  independent  columns. 

A  scalar  A  is  an  eigenvalue,  and  a  nonzero  vector  x  is  a  corresponding  eigenvector,  of 
a  square  matrix  A  if 

Ax  =  Ax.  (2.1) 

If  A  is  Hermitian,  then  A  is  real.  If  A  is  Hermitian,  and  for  all  x  6  Cm ,  x  ^  0, 

x*Ax  >  0,  (2.2) 

then  A  is  positive  semi-definite,  and  all  eigenvalues  of  A  are  nonnegative.  If  the  inequality 
in  (2.2)  is  strict,  then  A  is  positive  definite  and  all  of  the  eigenvalues  of  A  are  positive. 


The  eigenvalues  of  a  square  matrix  A  are  the  roots  of  the  characteristic  polynomial 

d( A)  =  |A  -  A/|,  (2.3) 

where  |  •  |  denotes  the  determinant.  If  zero  is  an  eigenvalue,  then  |A|  =  0  and  the  matrix 
A  is  said  to  be  singular,  otherwise  A  is  nonsingular.  The  multiplicity  of  an  eigenvalue  as 
a  root  of  d( A)  is  called  the  algebraic  multiplicity  of  the  eigenvalue.  The  dimension  of  the 
subspace  of  eigenvectors  corresponding  to  an  eigenvalue  is  called  the  geometric  multiplicity 
of  the  eigenvalue.  The  geometric  multiplicity  of  an  eigenvalue  is  always  less  than  or  equal 
to  its  algebraic  multiplicity. 

Matrices  for  which  the  algebraic  and  geometric  multiplicities  of  at  least  one  eigenvalue 
are  not  equal  are  said  to  be  defective ,  or  non-diagonalizable.  When  the  multiplicity  of 
an  eigenvalue  is  referred  to  without  a  modifier,  algebraic  multiplicity  is  implied.  When 
we  refer  to  a  set  of  eigenvalues,  or  to  the  cardinality  of  such  a  set,  without  explicitly 
stating  that  we  mean  distinct  eigenvalues,  then  it  is  to  be  understood  that  we  have  in 
mind  eigenvalues  repeated  according  to  their  algebraic  multiplicities.  Sometimes  we  will 
state  this  idea  briefly  by  using  the  phrase  ‘counting  multiplicities’. 

Two  square  matrices  which  represent  the  same  linear  transformation,  possibly  with 
respect  to  different  bases,  are  said  to  be  similar.  In  particular,  similar  matrices  have  the 
same  eigenvalues.  We  state  this  formally  as 

Definition  2.1.1  (Similarity)  Two  m  X  m  matrices  A  and  B  are  said  to  be  similar  if 
there  exists  a  nonsingular  matrix  S  such  that 

A  =  S~lBS. 

If  A  is  Hermitian,  then  A  is  similar  to  a  diagonal  matrix  (which  must  have  the  eigen¬ 
values  of  A  as  its  diagonal  elements).  There  is  a  matrix  S  which  provides  the  similarity 
transformation  and  is  unitary,  i.e.  S~l  =  S'.  (A  matrix  U  for  which  U'U  =  I  is  said  to 
be  unitary;  if  UTU  =  /,  then  U  is  orthogonal.)  The  following  result  is  the  spectral  theorem 
for  Hermitian  matrices: 

Theorem  2.1.1  (Spectral  theorem  for  Hermitian  matrices)  Let  A  be  a  Hermitian 
matrix.  Then  there  is  a  matrix  U  such  that  U'U  =  I  and 

A  =  UAU', 

where  A  is  a  real  diagonal  matrix  whose  diagonal  elements  are  eigenvalues  of  A,  and  where 
the  columns  of  U  are  corresponding  eigenvectors. 

If  A  is  Hermitian,  then  there  is  a  set  of  orthonormal  eigenvectors  (i.e.  eigenvectors  {«,} 
for  which  u'uj  equals  one  if  i  =  j,  and  zero  otherwise);  the  columns  of  U  form  one  such 
set.  In  general,  the  matrix  U  of  the  theorem  is  not  unique.  If  A  is  real  and  symmetric, 
then  U  can  be  selected  to  be  real  also. 

2.1.2  The  Singular  Value  Decomposition 

Assume  that  A  is  an  m  x  n  matrix  with  m  <  n,  and  that  the  rank  of  A  is  ^  <  m.  Then 
A’ A  and  AA '  are  Hermitian,  positive  semidefinite  matrices.  The  eigenvalues  of  AA*, 
which  we  denote  <r?,  where 

*?>*!>. ••><&>  0,  (2.4) 
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are  also  eigenvalues  of  A" A.  If  n  >  m,  then  the  n  x  n  matrix  A' A  has  n  -  m  additional 
eigenvalues  which  equal  zero.  The  nonnegative  numbers  {ffi}™  j  are  called  singular  values, 
and  we  have  the  following  result,  which  is,  in  a  sense,  one  possible  extension  of  Theorem 
2.1.1  to  general  matrices  (Horn  and  Johnson,  1985,  p.  414): 

Theorem  2.1.2  (Singular  Value  Decomposition)  Let  A  be  an  arbitrary  m  x  n  ma¬ 
trix,  with  m  <  n,  and  let  the  rank  of  A  be  q  <  m.  Then,  there  exist  unitary  matrices  U 
and  V,  where  U  is  m  X  m  and  V  is  n  x  n,  and  an  m  X  n  diagonal  matrix  E,  such  that 


A  =  UZVm. 


The  m  diagonal  elements  ofE  are  the  singular  values  of  A,  denoted  {ct,}’T1  ,  where 
0X  >  02  >  .  ■  ■  >  Oq  >  0  =  Oq+ 1  =  .  .  .  =  Om. 

The  columns  of  U  and  V  are  called  left  and  right  singular  vectors,  respectively,  of  A. 
The  columns  ofU  are  eigenvectors  of  A  A",  and  the  columns  ofV  are  eigenvectors  of  A’  A 
(arranged  in  the  same  order  as  the  corresponding  eigenvalues  o( ). 

Here  again,  if  A  is  real,  we  can  find  real  orthogonal  matrices  U  and  V.  If  A  is  a  square 
matrix  with  eigenvalues  A,  and  singular  values  a,,  then 

max  a,  >  max  |A,|.  (2.5) 

I  t 


2.1.3  The  Jordan  Canonical  Form 

Not  all  matrices  are  diagonalizable,  that  is,  similar  to  a  diagonal  matrix.  Any  square 
matrix,  however,  is  similar  to  a  matrix  which  is  nearly  diagonal,  and  this  near-diagonal 
representation  is  referred  to  as  the  Jordan  Canonical  Form  (Horn  and  Johnson,  1985, 
Chapter  3). 

Let  A  be  an  arbitrary  m  x  m  matrix  of  rank  q  <  m.  There  exists  a  nonsingular  matrix 
5  such  that 

J  =  S~lAS ,  (2.6) 

where 

J  =  diag(Ji . Jr),  (2.7) 

a  Jordan  form  matrix,  is  block  diagonal,  with  r  <  m  blocks.  Each  Jordan  block  Ji  has 
an  eigenvalue  of  A,  A,,  on  its  main  diagonal,  ones  on  the  diagonal  for  which  the  column 
index  is  one  greater  than  the  row  index,  and  zeros  everywhere  else.  For  example,  if  Ji 
happens  to  be  4  x  4,  then  it  will  be  a  matrix  of  the  form 


J,(  A.) 


A,  1 
A,  1 
A,  1 
A, 


(2.8) 


If  the  dimension  of  is  u,  x  n,,  then  n,  =  m,  where  the  sum  of  the  geometric 
multiplicities  of  distinct  eigenvalues  is  r.  The  Jordan  form  exists  for  any  square  matrix, 
and  it  is,  except  for  permutations  of  rows  and  columns,  unique. 
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2.1.4  Matrices  Having  a  Diagonalizable  Nullspace 


Let  A  be  an  m  x  m  matrix  of  rank  q.  If  A  is  nonsingular,  then  q  =  m  and  there  exists  a 
nonsingular  matrix  B  and  a  nonsingular  Jordan  form  matrix  J  such  that  A  =  B~lJB.  If 
A  is  singular,  then  q  <  m  and  there  exists  a  nonsingular  matrix  B  such  that 


A  =  B~l 


J»Xl  Ojx(m-i) 

0(m— *)x*  N(m—t)x  (m—t) 


B, 


(2.9) 


where:  s  <  q,  J  is  a  nonsingular  s  x  s  Jordan  form  matrix,  and  N  is  a  matrix  of  Jordan 
blocks  corresponding  to  a  zero  eigenvalue.  The  rank  of  J ,  s,  is  equal  to  the  rank  of  A,  q,  if 
and  only  if  N  =  0.  If  N  ^  0,  then  s  <  q,  since  the  nonzero  rows  of  BAB~l  corresponding 
to  rows  of  N  are  each  linearly  independent  of  the  rows  of  BAB~X  corresponding  to  rows 
of  J. 

Consider  the  submatrix  N  in  (2.9),  and  let  the  typical  element  of  this  matrix  be 
denoted  nij.  For  l  =  0, . . . ,  m  —  s  —  1,  define  the  /th  super-diagonal  to  be  the  set  of  entries 
si  =  The  only  nonzero  elements  of  N  are  on  the  first  super-diagonal,  and 

these  values  equal  one.  It  is  easy  to  show  by  direct  calculation  that  any  nonzero  elements 
of  N 2  must  be  ones  on  the  second  super-diagonal.  To  see  this,  compute  the  square  of  any 
Jordan  block  corresponding  to  a  zero  eigenvalue,  for  example  (2.8)  with  =  0.  Similarly, 
N1,  for  l  <  m  -  s,  must  be  zero  everywhere  except  possibly  on  the  /th  super-diagonal.  It 
follows  that  Nl  =  0  for  all  /  >  m  —  s. 

A  matrix  which  when  raised  to  some  power  is  equal  to  a  zero  matrix  is  said  to  be 
nilpotent,  which  is  the  reason  why  the  letter  ‘JV*  is  used  in  (2.9).  The  smallest  positive 
integer  i  such  that  JV‘  =  0  is  called  the  index:  of  both  the  matrix  N  and  the  matrix  A  in 
the  Jordan  form  representation  (2.9).  If  a  matrix  A  is  nonsingular,  we  define  it  to  have 
index  i  =  0. 

When  N  in  (2.9)  is  a  zero  matrix,  then  the  geometric  multiplicity  of  zero  as  an  eigen- 
vlaue  of  a  singular  matrix  equals  its  algebraic  multiplicity.  If  N  has  a  nonzero  block,  then 
this  is  no  longer  the  case.  We  will  refer  to  the  class  of  singular  matrices  for  which 


A  =  B~x 


0(m-9)x<j  0(m-7)x(m-<j) 


B, 


(2.10) 


for  nonsingular  J  and  B,  as  matrices  having  a  diagonalizable  nullspace.  We  are  introducing 
this  nonstandard  terminology  in  this  thesis,  since,  for  our  purposes,  it  is  more  suggestive 
than  the  usual  definition:  i.e.  that  a  matrix  A  is  of  the  form  (2.10)  if  and  only  if  A  has  a 
group  inverse  (Campbell  and  Meyer,  Chapter  7).  However,  it  will  be  convenient  to  express 
certain  results  in  terms  of  the  group  inverse  of  a  matrix,  so  we  define  this  concept  next. 


Definition  2.1.2  (Group  Inverse)  Let  A  be  an  arbitrary  square  matrix.  A  generalized 
inverse  matrix,  A*,  such  that 

1.  A*AA*  =  A*, 

2.  A  A*  A  =  A,  and 

3.  A  A*  =  A*  A 

is  called  the  group  inverse  of  the  matrix  A. 
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If  A *  exists,  then  it  is  unique.  If  A  is  singular  and  A*  exists,  then  A  must  be  of  the 
form  (2.10).  We  can  see  by  direct  calculation  that 


A*  =  B~x 


J-1 

Jqxq 


0, 


<?*("»-?) 


[  0(m-9)x(m-?)  J 


B 


(2.11) 


is  the  group  inverse  of  A.  For  a  detailed  discussion  of  the  properties  of  the  group  inverse, 
see  Chapter  7  of  Campbell  and  Meyer  (1979). 


2.1.5  Congruence 

Definition  2.1.3  (Congruence)  A  square  matrix  B  is  said  to  be  congruent  to  a  matrix 
A  if  there  exists  a  nonsingular  matrix  S  such  that 

B  =  SAS’. 


It  is  easy  to  show  that  the  properties  of  being  positive  definite  and  positive  semi-definite 
are  preserved  by  a  congruence  transformation. 

Lemma  2.1.1  Let  A  be  positive  semi- definite,  and  let  B  be  congruent  to  A.  Then  B  is 
positive  semi-definite.  If  A  is  positive  definite,  then  B  is  also. 

Proof:  For  some  nonsingular  matrix  S,  and  any  nonzero  vector  x,  we  have  that 

x'Bx  =  x'SAS’x  =  (S*x)*A(S*x)  =  y' Ay  >  0,  (2.12) 

since  A  is  positive  semi-definite  by  hypothesis.  If  A  is  positive  definite  and  i  ^  0,  then 
y  —  S’x  is  not  zero,  the  inequality  (2.12)  is  strict,  and  hence  B  is  positive  definite.  I 

2.1.6  Nonnegative  Matrices 

There  is  an  extensive  theory  for  matrices  having  nonnegative  elements  (e.g.,  Horn  and 
Johnson,  1985,  chapter  9).  A  matrix  A  is  said  to  be  positive,  and  we  write  A  >  0,  if  all  of 
the  elements  of  A  are  strictly  positive.  Similarly,  if  A  has  only  nonnegative  elements,  we 
say  that  A  is  nonnegative,  and  we  write  A  >  0.  The  fundamental  theorem  in  the  theory 
of  nonnegative  matrices  is  the  Perron-Frobenius  theorem,  a  special  case  of  which  we  state 
below. 

The  maximum  of  the  moduli  of  the  eigenvalues  of  a  matrix  is  called  the  spectral  radius, 
and  denoted  p(A).  Since  this  is  an  important  notion,  we  give  a  formal  definition: 

Definition  2.1.4  (Spectral  Radius)  Let  A  beanmxm  matrix  with  eigenvalues  {A,}J!Lj , 
where  the  A,  need  not  all  be  distinct.  The  spectral  radius  of  A  is  defined  by 

p(A)  =  max  |A;j. 

Ki<m 

The  spectral  radius  is  the  radius  of  the  smallest  circle,  centered  at  the  origin,  which  con¬ 
tains  all  of  the  eigenvalues.  We  can  now  state  a  version  of  the  Perron-Frobenius  theorem. 

Theorem  2.1.3  (Perron-Frobenius  )  Let  A  >  0  be  a  positive  m  x  m  matrix,  and 
assume  that  the  eigenvectors  of  A  have  norm  one.  Then  the  following  are  among  the 
properties  of  A: 
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1.  The  spectral  radius  of  A  is  equal  to  p,  where  p  is  a  real  eigenvalue  of  A  with  algebraic 
multiplicity  one,  and  p  is  the  unique  eigenvalue  of  modulus  p. 

2.  The  matrix  A  has  a  positive  eigenvector  x  corresponding  to  p. 

3.  Denote  the  sums  of  the  values  in  the  ith  row  of  A  by  r<,  and  the  ordered  row  sums, 
from  smallest  to  largest,  by  r^y  Then 


r(i)  <  P  <  r(m). 


We  will  refer  to  the  positive  eigenvalue  p  and  the  corresponding  positive  eigenvector  x 
(of  norm  one)  of  the  above  theorem  as  the  Perron- Frobenius  eigenvalue  and  Perron- 
Frobenius  eigenvector  respectively. 

A  nonnegative  matrix  for  which  all  of  the  row  sums  equal  one  is  called  stochastic.  It 
follows  immediately  from  the  Perron-Frobenius  theorem  that  for  a  stochastic  matrix  the 
Perron-Frobenius  eigenvalue  and  eigenvector  are  p  =  1  and 


x 


(2.13) 


respectively. 

We  have  as  another  consequence  of  the  Perron-Frobenius  theorem  the  following  result 
relating  positive  matrices,  Perron-Frobenius  eigenvectors,  and  stochastic  matrices: 

Lemma  2.1.2  Every  positive  matrix  is  similar  to  a  matrix  proportional  to  a  stochastic 
matrix 

Proof:  Let  A  be  positive,  and  let  Dx  be  the  diagonal  matrix  whose  diagonal  elements 
are  those  of  the  Perron-Frobenius  eigenvector  x  of  A.  Then 

B  =  D~x  ADX 


has  constant  row  sums.  To  see  this,  let  a*;  denote  the  typical  element  of  A,  let  r,  denote 
the  sum  of  the  elements  in  the  ith  row  of  B,  and  let  Xi  denote  the  ith  diagonal  entry  in 
Dx.  Then,  for  any  i, 


]=i  Xi  Xi 


pM). 


therefore,  B/p(A)  is  stochastic.  I 

Knowing  that  the  spectral  radius  is  bounded  by  the  extremal  row  sums  often  provides 
useful  upper,  but  not  lower,  bounds  on  p(A).  One  reason  for  this  is  that  positive  kernels 
often  decrease  to  zero  at  the  boundaries  of  their  domain,  and  the  corresponding  row  sums 
of  a  discretized  matrix  will  be  near  zero.  Actually,  the  average  row  sum  is  still  a  lower 
bound  for  the  spectral  radius  of  a  positive  symmetric  matrix  A,  as  we  show  below: 


Lemma  2.1.3  The  Perron-Frobenius  eigenvalue  of  a  positive  symmetric  matrix  is  bounded 
below  by  the  average  row  sum 
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Proof:  Let  A  be  positive  and  m  x  m  with  elements  a,j,  and  Perron-Frobenius  eigen¬ 
value  and  eigenvector  denoted  by  p  and  x,  respectively.  Since  p  is  larger  in  modulus  than 
any  other  eigenvalue  of  A,  we  have  that  (Strang,  1976,  p.  253) 


sup 

v*o 


VTAy 

yTy 


=  p- 


Define  the  unit  vector 

r  =  ]T/y/m. 


Then 


I 


ztAz  =  53  £  aij/m  =  J2  r‘/m  -  P- 


i=i  j= l 


i=i 


2.2  Matrix  Analysis 

There  are  many  ways  to  define  a  norm  for  square  matrices,  and  corresponding  to  each 
norm  there  is  a  metric  on  the  space  of  square  matrices.  There  is,  therefore,  a  theory  of 
matrix  analysis,  for  which  the  two  volume  work  of  Horn  and  Johnson  (1985,  1990)  is  an 
excellent  reference. 

2.2.1  Matrix  Norms 

A  matrix  norm  satisfies  the  following  five  axioms  (Horn  and  Johnson,  1985,  p.  290): 

Definition  2.2.1  (Matrix  Norm)  Let  |  •  |  be  a  mapping  from  the  space  of  square  ma¬ 
trices,  with  elements  in  C,  to  11.  The  function  ||  •  J  is  a  matrix  norm  if,  for  all  m  x  m 
matrices  A  and  B, 

1.  Ml  >  o 

2.  Ml  =  0  if  and  only  if  A  =  0 

3.  \cA\  =  jc|MII  for  scalars  c  €  C 
l  \A  +  Bl<\\A\  +  \\B\. 

5.  Mflll  <  MII5«. 

Properties  (1-4)  are  the  axioms  of  a  vector  norm ;  a  norm  with  property  (5)  is  called 
submultiplicative. 

The  largest  singular  value  of  a  matrix  provides  a  matrix  norm,  the  spectral  norm  (Horn 
and  Johnson,  1985,  p.  295): 

Lemma  2.2.1  (Spectral  Norm)  The  largest  singular  value  of  a  matrix  A  is  a  matrix 
norm,  called  the  spectral  or  /2  norm  and  denoted 

Mb  =  [p(A*A)]1/2  =  <7j. 

On  occasion,  we  will  use  another  matrix  norm,  the  norm,  which  is  easily  expressed 
in  terms  of  the  elements  of  a  matrix  (Horn  and  Johnson,  1985,  p.  295): 
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Lemma  2.2.2  ( / <*,  norm)  Let  A  be  an  m  x  m  matrix  with  typical  element  axl.  The 
function  Hoc,  defined  by 

m 

Mice  =  max  la*jl 

1  <t<m  . 

- j=l 

is  a  matrix  norm,  called  the  l^  norm,  or  simply  the  infinity  norm. 

The  spectral  norm  should  not  be  confused  with  the  spectral  radius.  In  general,  the 
spectral  radius  is  not  a  norm,  but  for  each  fixed  square  matrix  A  it  is  the  greatest  lower 
bound  for  the  values  of  all  matrix  norms  of  A  (Horn  and  Johnson,  1985,  p.  297). 

Theorem  2.2.1  Let  a  matrix  A  and  e  >  0  be  given.  Then 

1.  For  any  matrix  norm  j|  •  ||a, 

p(A)  <  \A\a. 

2.  There  exists  a  matrix  norm  |  •  j|/j  such  that 

p(A)  <  \A\0  <  p(>4)  +  e 


2.2.2  Convergent  Matrices 

A  square  matrix  A  is  said  to  be  convergent  (Horn  and  Johnson,  1985,  p.  298)  if 

lim  Ak  =  0,  (2.14) 

k— .00 

that  is,  if  all  of  the  elements  of  Ak  decrease  to  zero  in  absolute  value  as  k  — ►  oo.  Another 
definition  of  a  convergent  matrix,  easily  shown  to  be  equivalent  to  (2.14),  is 

Definition  2.2.2  (Convergent  Matrix)  Anmxm  matrix  A  is  convergent  if,  for  all 
vectors  v  €  Cm , 

lim  Akv  =  0.  (2.15) 

k— >oo 

A  necessary  and  sufficient  condition  for  a  matrix  to  be  convergent  is  given  by  the  following 
theorem  (Horn  and  Johnson,  1985,  p.  138): 

Theorem  2.2.2  A  square  matrix  A  is  convergent  if  and  only  if  p(A)  <  1. 

If  p(A)  =  1,  then  the  powers  of  A  can  converge  to  a  nonzero  matrix.  A  matrix  A  for 
which  this  is  the  case  is  sometimes  referred  to  as  semi-convergent.  We  discuss  this  idea 
in  more  detail  below. 

If  p(A)  =  1  and  A  has  a  Jordan  block  which  is  an  identity  submatrix,  then  there  is  a 
corresponding  subspace  U  such  that  for  u  €  U,  Anu  does  not  diverge,  although  Anu  -f*  0 
unless  u  -  0.  If  the  Jordan  form  has  a  block  /  +  N,  where  N  0  is  nilpotent,  then  Anu 
blows  up  for  ti  in  a  corresponding  space. 

The  following  theorem  (Horn  and  Johnson,  1985,  p.  299)  says  something  about  the 
rate  at  which  a  convergent  matrix  approaches  zero: 

Theorem  2.2.3  Let  A  be  an  m  x  m  matrix,  and  let  e  >  0  be  given.  Then,  there  exists  a 
constant  C  =  C(A,t)  such  that 

|Mfc).,|<C(/>(A)  +  f)*,  (2.16) 

for  all  k  =  1,2,3, .. .,  and  for  all  i,j  =  1,2, ...  ,m. 
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We  will  need  to  sum  series  of  powers  of  matrices  in  Chapter  3.  The  following  useful 
lemma  follows  immediately  from  Theorem  2.2.2: 

Lemma  2.2.3  (Geometric  Series)  If  A  is  a  square  matrix  and  p(I  -  A)  <  1,  then  A 
is  nonsingular  and 

OO 

l-l 


■£(i-ay~a- 


(2.17) 


1=0 


If  p[I  -  A)  =  1,  then  (/  -  A)'  -f*  0,  so  (2.17)  cannot  be  a  convergent  series.  However,  the 
partial  sums  of  (2.17)  may  remain  bounded,  as  can  be  seen  from  the  example 


A  = 


1  -  i  0 
0  1/2 


(2.18) 


2.2.3  Condition  Numbers 


With  respect  to  any  matrix  norm,  the  condition  number  of  a  nonsingular  matrix  A  is 
defined  as 

(2.10) 

If  A  is  singular,  then  k(A)  =  oo.  Note  that,  for  any  matrix  norm,  and  any  nonsingular 
matrix  A, 

k(A)  >  | Aj4~ 1 1  =  |/|  >  p(I)  =  1.  (2.20) 

A  condition  number  provides  a  measure  of  how  nearly  singular  a  matrix  is,  with  a  large 
condition  number  suggesting  that  a  matrix  is  ‘nearly’  singular.  Let  K f  =  g  be  a  matrix 
equation.  If  k  is  a  condition  number  with  respect  to  a  norm  ||  •  ||.,  then  for  any  two  vectors 
/  and  /  (Stoer  and  Bulirsch,  1980,  p.  179), 


!/-/!.  ,  JA7-A7II. 

>  rV  " 


ll/I. 


BA7II. 


(2.21) 


so  a  condition  number  relates  the  relative  change  in  the  right  hand  side  of  an  equation 
to  the  relative  change  in  a  solution.  The  most  often  used  condition  number  is  defined  in 
terms  of  the  I2  norm.  Let  A  be  a  nonsingular  m  x  m  matrix  with  largest  and  smallest 
singular  values  given  by  ox  and  crm,  respectively.  Then 


*2(4)=  — 

is  the  condition  number  of  A  with  respect  to  the  I2  norm. 


(2.22) 


2.3  Elementary  Notions  of  Functional  Analysis 

Since  we  are  ultimately  interested  in  approximating  integral  equations  by  matrix  equa¬ 
tions,  and  attempting  to  solve  these  resulting  matrix  equations  on  a  computer,  most  of 
the  theoretical  discussions  in  this  thesis  will  be  in  m-dimensional  space  which  we  take, 
for  flexibility,  to  be  Cm  rather  than  TZm.  However,  we  will  make  use  of  some  function- 
space  results  concerning  integral  equations  in  a  Hilbert  space,  and  so  we  review  here  the 
functional  analysis  that  we  will  require. 


2.3.1  Normed  Vector  Spaces 

A  vector  space,  H,  is  a  set  of  elements,  called  vectors ,  together  with  the  operations  of 
vector  addition  and  scalar  multiplication,  over  a  scalar  field.  We  will  take  this  scalar  field 
to  be  either  the  complex  numbers  C,  or  the  real  numbers  TZ.  The  defining  properties  of  a 
vector  space  are  as  follows: 

Definition  2.3.1  (Vector  Space)  Let  H  be  a  nonempty  set,  let  C  be  a  scalar  field,  and 
let  there  be  two  binary  operations  ‘+  ’  and  ‘x  corresponding  to  vector  addition  and  scalar 
multiplication,  respectively.  Let  x,y,z  be  arbitrary  points  in  H,  and  let  a,  (3, 7  £  C  be 
arbitrary  scalars.  Then  Ft  is  a  vector  space,  and  the  points  in  Ft  are  called  vectors,  if 
all  of  the  following  properties  are  satisfied: 

1.  There  is  a  binary  operation,  called  vector  addition,  that  assigns  to  each  pair  of 
elements  x,y  £  H  a  unique  element  ofH  called  their  sum,  and  denoted  x  +  y.  For 
all  x,y,z  £  H: 

(a)  x  +  y  =  y  +  x, 

(b)  x  +  (y  +  z)  =  (x  +  y)  +  z, 

(c)  there  is  an  element  0  €  Ft  such  that  x  +  0  =  x,  and 

(d)  there  is  an  element  —x  £  Fi  such  that  x  +  (—1)  =  0. 

2.  There  is  a  rule  which  assigns  to  each  pair  a  '  .  ,.nd  x  £  F i  a  unique  vector,  called  the 
scalar  product  of  a  and  x,  and  denoted  ax.  For  arbitrary  a,  (3  £  C  and  x,y  £  Ft, 
the  scalar  product  has  the  ft 'lowing  properties: 

(a)  a((3x)  =  ( af3)x , 

(b)  0(1  +  y)  =  ai  +  ay, 

(c)  (a  +  (3)x  —  ax  +  (3x,  and 

(d)  la;  =  x . 

For  a  function  /,  which  assigns  for  each  x  £  A  an  element  y  =  f(x)  £  B,  we  write 
/  :  A  —*  B.  The  set  A  is  called  the  domain  of  /;  the  set  of  all  y  =  f(x)  for  x  £  A  is 
called  the  range  of  /  and  denoted  7 v(/).  A  function  is  also  sometimes  called  an  operator 
or  a  mapping,  and  we  will  use  these  terms  interchangeably,  although  different  terms  are 
customary  in  different  contexts. 

Functional  analysis  is  concerned  with  analysis  on  vector  spaces.  In  order  to  do  analysis, 
we  need  a  generalization  of  the  idea  of  the  distance  between  two  vectors  of  an  arbitrary 
vector  space.  This  leads  to  the  concept  of  a  metric: 

Definition  2.3.2  (Metric)  Let  X  be  a  set,  and  let  x,y,z  £  X  be  arbitrary  points.  A 
metric,  d(x,y)  :  X  x  X  —*  TZ  is  defined  by  the  following  properties: 

1.  d  is  finite  and  nonnegative, 

2.  d(x,y)  =  0  <=>  x  =  y, 

3.  d(x,  y)  =  d(y,x),  and 

4.  d(x,z)  <  d(x,y)  +  d(y,z). 
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We  define  next  a  function  mapping  vectors  into  nonnegative  scalars  called  a  norm ,  thereby 
generalizing  the  notion  of  length  to  vectors  in  abstract  spaces: 

Definition  2.3.3  (Norm)  Let  H  be  a  vector  space,  and  let  x,y  €  H,  and  a  £  C  be 
arbitrary.  A  norm,  ||  •  |  :  7i  — ►  TZ,  is  defined  by  the  following  four  properties: 

1.  M>o, 

2.  |jz|  =  0  <=>  x  =  0, 

3.  |az||  =  |a|||z|,  and 
4 •  |*  +  yfi  <  |*|  +  Ivl- 

A  metric  can  always  be  defined  in  terms  of  a  norm,  for  example 

d(x,y)  =  \\x-  y||.  (2.23) 

A  normed  space  is  a  vector  space  together  with  a  norm,  (7t,||  •  ||).  Usually  the  norm 
is  understood,  and  the  normed  space  is  denoted  simply  H .  We  can  do  analysis  in  general 
normed  spaces;  in  particular,  we  can  define  limits  and  Cauchy  sequences. 

Definition  2.3.4  (Limit,  Convergence)  A  sequence  {i„}  in  a  normed  vector  space 
(Ti,  |J  •  |)  converges  to  a  limit  x  if  x  €  H,  and  for  every  c  >  0  there  exists  an  N  =  N(e) 
such  that  for  all  n  >  N 

Ik  -  *n|  <  f- 

Definition  2.3.5  (Cauchy  sequence)  A  sequence  {x„}  in  a  normed  vector  space  (H,  |- 
||)  is  called  a  Cauchy  sequence  if,  for  every  e  >  0,  there  exists  an  N  =  N(e)  such  that 
for  all  m,n  >  N 

II Im  -  In  ||  <  £• 

A  normed  vector  space  in  which  all  Cauchy  sequences  converge  (to  vectors  in  H)  is  called 
complete.  A  complete  normed  vector  space  is  a  Banach  space. 

We  can  discuss  limits  and  continuity  in  Banach  space,  but  we  have  no  notion  of 
orthogonality,  and  so  most  of  the  geometry  of  finite  dimensional  Euclidean  space  does  not 
apply  to  a  general  Banach  space.  However,  if  the  additional  structure  of  an  inner  product 
is  imposed  on  a  Banach  space,  the  complete  inner  product  space ,  or  Hilbert  space,  which 
results  has  a  p  'metry  which  is  in  some  ways  very  much  like  Euclidean  space.  An  inner 
product  is  d  ‘’i.ied  as  follows: 

Definition  2.3.6  (Inner  product)  Let  X  be  a  vector  space,  and  let  x,y,z  €  X  and 
a,fi  €  C  be  arbitrary.  An  inner  product  is  a  function  (•,•):  X  X  X  —  C  with  the 
following  properties: 

1.  (z  +  y,z)  =  (z,  z)  +  ( y,z ) 

2.  (ax,y)  =  a(x,y), 

3.  (x,y)  =  (y,x),  and 

4 ■  (I.*)  >  0;  (z,z)  =  0  <=>  z  =  0. 

A  fundamental  inequality  for  inner  products  is  the  Cauchy-Schwarz  inequality: 
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Lemma  2.3.1  (Cauchy-Schwarz  Inequality)  Let  x  and  y  be  any  vectors  in  an  inner 
product  space.  Then 

l(*,S/)i2  <  (*>*)(y,y),  (2.24) 

with  equality  if  and  only  if  either  x  =  0,  or  y  =  0,  or  y  =  ax  for  some  constant  a. 

An  inner  product  determines  a  norm, 

||x|  =  (x,x)''2,  (2.25) 

and  the  Cauchy-Schwarz  inequality  relates  this  norm  to  the  corresponding  inner  product. 

Another  important  result  which  holds  in  a  general  inner  product  space  is  the  Pytha¬ 
gorean  Theorem: 

Theorem  2.3.1  (Pythagorean  Theorem)  Let  x,  X\,  and  X2  be  vectors  in  an  inner 
product  space,  where  x  =  x\  +  X2  and  (ii,X2)  =  0.  Then 

NJ  =  I*.I’  +  M2- 

Proof: 

INI2  =  (Zl,Zl)  +  (*2,Z2)  +  (a:i,Z2)  +  (x2,Zl) 

=  ix,l!  +  |X,|J  I 

2.3.2  Hilbert  Space 

A  Banach  space  in  which  the  norm  is  determined  by  an  inner  product  is  called  a  Hilbert 
space.  An  example  of  a  Hilbert  space  with  scalar  field  TZ  is  Hm,  with  inner  product 
(x,y)  =  xTy  and  norm  ||x||  =  (x7!)1/2.  Another  important  example  is  the  space  of  square 
integrable  complex  valued  functions,  L2. 

Let  us  define  an  inner  product  (/,y)  on  the  vector  space  L2  of  all  Lebesgue  measurable 
complex  valued  functions  for  which 

[  |/(x)|2dx  <  00  (2.26) 

J  —  OO 


as  follows 

Definition  2.3.7  (Inner  Product  in  L2)  Let  f,g  €  L2.  The  inner  product  ( f,g )  is 

(/>«/)  =  /  f(x)g(x)dx, 

J —00 

where  the  integral  is  a  Lebesgue  integral. 

This  inner  product  determines  the  norm 

I/I2  =  (/,  /)1/2  =  ^ J°°jf(x)\>dx  (2.27) 

(Kreyszig,  1978,  p.  62).  Strictly  speaking,  by  a  function  /  we  mean  an  equivalence  class 
of  functions  which  are  equal  almost  everywhere.  It  turns  out  that  L2  is  complete  with 
respect  to  this  norm,  and  hence  is  a  Hilbert  space. 
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We  will  refer  to  this  space  as  L 2,  and  to  the  norm  |  •  I2  as  the  L2  norm.  We  will 
be  concerned  primarily  with  functions  of  a  real  variable,  and  we  will  sometimes  refer  to 
this  real  function  space  as  Li-  There  are  corresponding  spaces  for  functions  with  other 
domains  which  we  will  also  refer  to  as  Li-  Sometimes  the  domain  of  the  functions  in  the 
space  is  included  in  the  notation,  for  example  ^[0, 1]  is  the  space  of  square  integrable 
functions  of  a  single  variable  on  the  unit  interval.  We  will  use  the  notation  which  doesn’t 
indicate  the  domain  where  there  is  no  risk  of  confusion. 

By  analogy  with  the  inner  product  on  1Zm,  we  say  that  two  vectors  x  and  y  are 
orthogonal  if  ( x,y )  =  0.  If  e  is  a  unit  vector  and  x  is  any  vector,  then  we  call  (x,e)  the 
orthogonal  projection ,  or  simply  the  projection ,  of  x  onto  e.  A  sequence  of  unit  vectors 
{e^}  is  called  an  orthonormal  sequence  if 


(et  —  fiiji 


(2.28) 


where  StJ  is  the  Kronecker  6, 


t* 


1  if  i  =  j 
0  if  *  ^  j 


(2.29) 


An  orthonormal  sequence  is  called  complete  if  any  vector  in  the  space  can  be  expressed 
as  a  limit  of  linear  combinations  of  elements  in  this  sequence.  A  complete  orthonormal 
sequence  is  also  called  an  orthonormal  uasis ,  or  simply  a  basis.  In  particular,  L2  has 
such  bases.  A  Hilbert  space  which,  like  Li,  has  a  countable  orthonormal  basis  is  said 
to  be  separable.  Although  we  will  state  some  results  more  generally,  we  will  confine  our 
attention  primarily  to  Li- 

‘Complete’  thus  has  two  meanings.  A  normed  vector  space  is  complete  if  all  Cauchy 
sequences  converge;  an  orthonormal  system  in  a  Hilbert  space  is  complete  if  all  vectors  in 
the  space  can  be  expressed  as  limits  of  linear  combinations  of  vectors  in  the  orthonormal 
system.  It  will  always  be  clear  from  the  context  which  notion  of  completeness  is  to  be 
used. 

For  7im  there  is  a  natural  notion  of  dimension.  We  can  now  provide  a  general  definition 
for  Hilbert  space.  If  the  number  of  vectors  in  a  basis  is  finite,  then  this  number  is  the  same 
for  any  basis,  and  is  called  the  dimension  of  the  space.  If  a  space  has  an  orthonormal  basis 
consisting  of  infinitely  many  vectors,  then  all  bases  consist  of  infinitely  many  vectors,  and 
we  say  that  the  space  is  infinite  dimensional. 

For  any  /  €  W,  we  represent  /  formally  in  terms  of  a  basis  {e,}££,  as  a  Fourier  series 


OO 

/  =  ]Ta,et, 

t=i 


(2.30) 


where,  for  every  i,  a,  =  (/, e,).  The  a,  are  called  Fourier  coefficients  of  /  with  respect  to 
the  basis  {ej.  We  will  always  interpret  an  infinite  sum  such  as  (2.30)  to  mean  that 

N  II2 

/-X>,e,|  =0.  (2.31) 

Let  H  be  a  Hilbert  space.  Under  what  conditions  does  every  vector  f  £  H  have  a 
Fourier  series  representation  (2.30)?  Bessel’s  inequality  provides  a  first  step  toward  an 
answer  to  this  question: 


lim 

N— 00 


20 


Lemma  2.3.2  (Bessel  Inequality)  Let  {e,}  be  any  orthonormal  sequence  in  c  Hilbert 
space  H.  Let  x  £  H  be  arbitrary.  Then 

£|(x,e,)|2<||x|2  (2.32) 

t 

If  Ji  is  separable,  then  we  can  say  more: 

Lemma  2.3.3  (Parseval  Identity)  Let  {e,}  be  an  orthonormal  basis  in  a  separable 
Hilbert  space  H.  Let  x  6  H  be  arbitrary.  Then 

£|(z,e,)|2  =  |!*|2-  (2.33) 

If  we  are  working  in  a  separable  Hilbert  space,  then  we  can  use  Parseval’s  identity  to  show 
that  (2.31)  holds  for  any  vector  /.  Therefore,  if  we  interpret  convergence  in  the  sense  of 
convergence  in  norm,  /  has,  for  a  given  basis,  a  Fourier  series  (2.30).  It  can  be  shown 
that  this  series  is  unique.  A  good  discussion,  in  the  context  of  integral  equations,  of  the 
material  of  this  paragraph  is  in  Tricomi  (1957,  pp.  83-88). 

2.3.3  Linear  Operators 

We  will  be  concerned  with  linear  operators  in  so  we  give  a  formal  definition  of  a  linear 
operator: 

Definition  2.3.8  (Linear  Operator)  Let  U\  and  U2  be  vector  spaces.  A  linear  oper¬ 
ator  K  :U\  -*  U2  is  an  operator  (i.e.,  a  mapping)  such  that  for  any  x  and  y  in  U\,  and 
for  any  scalars  a, (3  €  C, 


K(  ax  +  /3y)  =  aK(x)  +  (3K(y). 

We  will  be  interested  exclusively  in  the  case  where  U\  and  U2  are  Hilbert  spaces.  We  will 
adopt  the  conventional  notation  Kx  for  K(x).  We  collect  here  some  definitions  for  classes 
of  linear  operators  which  will  be  needed  in  this  and  subsequent  chapters: 

Definition  2.3.9  (Bounded  Operator)  A  linear  operator  K  :  U\  — *  U2  between  Hilbert 
spaces  is  bounded  if  there  exists  a  real  number  c  such  that,  for  all  x  £  U\, 

|A'z|  <  c||z|.  (2.34) 

Definition  2.3.10  (Continuous  Operator)  Let  K  :  U\  -*  U2  be  a  linear  operator  be¬ 
tween  Hilbert  spaces.  I\  is  said  to  be  continuous  if  for  any  e  >  0,  there  exists  a  6  >  0 
such  that  for  any  vectors  xi  and  12  in  U\, 

||*i  -  x2|  <  S  =>  || Ax,  -  Kx2\  <  t. 

It  can  be  shown  (Kreyszig,  1978,  p.  97)  that  a  linear  operator  is  continuous  if  and  only 
if  it  is  bounded.  It  is  customary  to  talk  in  this  context  about  bounded,  not  continuous, 
operators. 

Let  H  be  a  Hilbert  space,  and  let  yo  €  Ti  be  arbitrary.  The  special  linear  operator 
K  :  H  — *  C  defined  by  Kx  =  (x,  yo)  is  bounded.  Also,  linear  operators  on  Rm  can  be 
represented  by  matrices,  and  are  necessarily  bounded. 
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Definition  2.3.11  (Operator  Norm)  Let  K  :  U\  — *  U?  be  a  bounded  linear  operator 
between  Hilbert  spaces.  The  norm  of  the  operator  K,  ||A'|,  is 


IIA'I  = 


sup 

x€lA  ,i^0 


1*1  ‘ 


Definition  2.3.12  (Adjoint  Operator)  Let  K  :  U\  — »  U?  be  a  bounded  linear  operator 
between  Hilbert  spaces.  The  adjoint  of  K  is  the  operator  Km  :  Ui  — ♦  U\  such  that,  for  all 
xelh  and  y  €  U2, 

( Kx,y )  =  (x,K“y). 


We  take  for  granted  here  that  this  definition  makes  sense:  that  is,  that  the  adjoint  exists. 
A  proof  that  the  adjoint  Km  of  a  bounded  linear  operator  K  exists,  is  bounded  and  unique, 
and  that  |/v  ||  =  JAT* |]  can  be  found  in  Kreyszig  (1978,  pp.  196-197). 

Definition  2.3.13  (Self-Adjoint  Operator)  A  bounded  linear  operator  K  is  said  to  be 
self-adjoint  ifU\  =  U2  and  K  =  Km. 

Definition  2.3.14  (Positive  Operator)  Let  K  :U  —*  U  be  a  self-adjoint  linear  opera¬ 
tor  on  a  Hilbert  space.  K  is  said  to  be  positive  if,  for  all  x  €  U , 

(Kx,x)  >  0. 


For  U  =  72m,  with  scalar  field  7 Z,  positive  operators  correspond  to  positive  semi-definite 
matrices. 

Eigenvalues  and  eigenvectors  can  also  be  defined  for  general  linear  operators: 

Definition  2.3.15  (Eigenvalue  and  Eigenvector)  Let  K  :  U  — ►  U  be  a  linear  oper¬ 
ator.  The  vector  x  £  U  is  an  eigenvector,  and  the  scalar  A  €  C  is  an  eigenvalue, 

if 

Kx  =  Ax. 

A  positive,  self-adjoint  linear  operator  has  only  real,  nonnegative  eigenvalues  (Kreyszig, 
1978,  p.  475,  problem  5).  In  matrix  analysis,  the  adjoint  corresponds  to  the  transposed 
complex  conjugate  (or  transpose,  for  symmetric  matrices),  and  self-adjoint  operators  cor¬ 
respond  to  Hermitian  (or  symmetric,  in  the  real  case)  matrices. 


2.3.4  Orthogonal  Complements  in  Hilbert  Space 

We  review  in  this  subsection  some  basic  ideas  about  the  geometry  of  Hilbert  space  which 
we  will  make  extensive  use  of  in  Chapter  3.  A  detailed  exposition  of  this  material  appears 
in  Kreyszig  (1978,  Chapter  3). 

We  begin  with  some  elementary  notions.  Let  H  be  a  Hilbert  space,  and  let  Hi  C  H 
be  an  arbitrary  subset  of  H .  We  say  that  Hi  is  a  subspace  of  H  if  it  is  a  vector  space.  If 
7i\  contains  all  of  its  limit  points  (with  respect  to  the  norm  induced  by  the  inner  product 
on  7i),  then  7i\  is  said  to  be  a  closed  subspace  of  7i.  A  subspace  of  a  complete  metric 
space  is  itself  complete  if  and  only  if  it  is  closed  (Kreyszig,  1978,  p.  30),  and  hence  H\  is 
a  Hilbert  space  with  respect  to  the  inner  product  on  H  if  and  only  if  Tii  is  closed. 

Let  H  be  a  Hilbert  space,  and  let  Hi  C  H  and  H2  Q  H  be  arbitrary  subspaces. 
The  subspaces  Hi  and  H2  are  said  to  be  orthogonal  if,  for  any  hi  6  Hi  and  Aj  6  H2, 
( Aj ,  Aa)  =  (A2,  Aj )  =  0.  We  write  this  as  Hi  i.  H2- 
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The  set  of  all  vectors  orthogonal  to  a  subspace  Hi, 

Hf  =  {h  £  H|h  _L  Hi}, 


(2.35) 


is  a  subspace  called  the  orthogonal  complement  of  Hi.  The  orthogonal  complement  Hf  is 
closed,  and  if  Hi  is  closed,  then  Hf1  =  Hi  (Kreyszig,  1978,  p.  149).  In  general,  we  have 
that  Hf 1  =  Hi,  where  we  denote  the  closure  of  a  space  by  an  overbar. 

If  every  h  £  H  can  be  expressed  uniquely  as  h  =  hi  +  h2,  where  h\  £  Hi  and  h2  G  H 2, 
then  H  is  equal  to  the  direct  sum  of  the  subspaces  Hi  and  H2,  and  we  write 

H  =  Hi  ©  H2.  (2.36) 

Assume  now  that  Hi  is  an  arbitrary  closed  subspace,  and  that  H2  =  Hf .  Then  H  can 
be  written  as  the  direct  sum  (2.36)  (Kreyszig,  1978,  p.  146).  This  result  is  referred  to  as 
the  projection  theorem.  If  h  =  hi  +  h2,  where  h\  £  Hi  and  h2  €  H2,  we  say  that  h\  is  the 
orthogonal  projection  (or  briefly,  the  projection)  of  h  onto  the  closed  subspace  Hi. 

Given  a  linear  operator  between  Hilbert  spaces,  K  :  U\  —*  U2,  define  the  nullspace  of 
K  by 

M(K)  =  {x£  Ui\Kx  =  0}.  (2.37) 

It  is  easy  to  show  that  M(K)  is  a  closed  subspace.  The  range  of  K  is 

7Z{K)  =  {y  £  U2\y  =  Kx  for  some  x  £  Ui]  .  (2.38) 

The  operator  K  is  said  to  be  of  infinite  rank  if  the  dimension  of  TZ( K)  is  infinite,  otherwise 
K  is  said  to  be  of  finite  rank. 

The  nullspace  and  range  for  the  adjoint  operator,  Km  :  t/2  — ►  t7j ,  are 

Af(Km)  =  {1  €  t/2|A’*x  =  0},  (2.39) 

and 

H(K‘)  =  {y  €  Uj\y  =  K'x  for  some  x  £  f/2}  .  (2.40) 

If  U2  is  infinite  dimensional,  then  7v(A')  need  not  be  closed  (and  similarly  for  Ui  and 

*(*•)). 

It  is  not  difficult  to  establish  (e.g.,  Kress,  1989,  p.  226)  that 


K(K)1  =AT(IC), 

(2.41) 

H(A-*)X=AW, 

(2.42) 

U2  =  H(A')©  Af(A'*), 

(2.43) 

Ux  =  ft(Av*)©Af(A'). 

(2.44) 

If  K  is  self-adjoint,  then  U\  =  f/2,  Ii  =  A'*,  and  7v(A')X  =  Af(K). 
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2.3.5  Compact  Linear  Operators 

Compact  operators  on  an  infinite  dimensional  Hilbert  space  have  a  structure  that  is  in 
many  ways  similar  to  that  of  matrices  in  a  finite  dimensional  space.  The  prototypical 
compact  operators  are  integral  operators.  We  begin  with  a  definition: 

Definition  2.3.16  (Compact  Operator)  Let  K  :  U\  — »  Ui  be  a  linear  operator  between 
separable  Hilbert  spaces.  K  is  said  to  be  compact  if  for  every  bounded  sequence  {x,}°^, 
in  U\  the  sequence  { j  has  a  convergent  subsequence. 

A  compact  operator  is  necessarily  bounded,  since  otherwise  there  would  exist  a  bounded 
sequence  {x,}^j  such  that  J/v'x,|  — *  oo,  and  for  which  {A'x.j^j  has  no  convergent 
subsequence.  Since  an  operator  is  bounded  if  and  only  if  it  is  continuous,  it  follows  that 
a  compact  operator  must  be  continuous. 

The  most  important  properties  of  compact  operators  for  our  purposes  are  the  spectral 
properties;  and  with  respect  to  spectral  properties,  compact  operators  behave  very  much 
like  matrices.  We  state,  without  proof,  the  spectral  theorem  for  compact,  self-adjoint 
operators 

Theorem  2.3.2  (Spectral  Theorem  for  Compact  Self-Adjoint  Operators)  Let  K 
U  — »  U  be  a  compact,  self-adjoint  operator  on  a  Hilbert  space.  There  exists  a  sequence  of 
vectors  {fa),  such  that 

($»» tfrj)  ~  bij ; 

and  a  bounded  sequence  of  nonzero  real  scalars  {A,},  such  that,  for  all  i, 

Kepi  =  A  {fa. 

Each  eigenvalue  can  correspond  to  at  most  a  finite  number  of  fa.  Thus  we  can,  without 
loss  of  generality,  label  the  A,  in  nonincreasing  order  of  absolute  value,  so  that 

l^i  I  >  I A2I  >  •  •  • , 

are  the  nonzero  eigenvalues  of  the  operator  K,  and  the  corresponding  orthonormal  vectors 
{<}>,}  are  eigenvectors.  If  there  are  infinitely  many  distinct  and  nonzero  eigenvalues,  then 
these  eigenvalues  must  have  zero  as  an  accumulation  point. 

For  any  x  €  U ,  we  have  that 


A’x  =  A;  (x,fa)fa, 
i 

where  by  ’  we  mean  the  sum  over  all  (finite  or  infinitely  many)  nonzero  eigenvalues. 

Zero  may  also  be  an  eigenvalue  of  a  compact  operator  A',  and,  if  so,  we  denote  this 
eigenvalue  of  special  importance  by  Ao-  A  good  source  for  the  spectral  theory  of  compact 
operators  is  Kreyszig  (1978,  Chapter  8). 

If  a  linear  operator  is  compact  but  not  self-adjoint,  then  the  eigenvalues  need  not  be 
real  or  even  exist.  For  any  compact  operator,  the  subspace  spanned  by  the  eigenvectors 
corresponding  to  a  single  eigenvalue  can  have  dimension  greater  than  one,  but  must  be 
finite  dimensional.  When  discussing  compact  operators  which  are  not  self-adjoint,  we  will 
make  use  of  the  singular  vector  expansion ,  which  is  a  natural  extension  of  the  singular 
value  decomposition  to  compact  operators  in  a  Hilbert  space  (Smithies,  1958,  Chapter 
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8).  If  K  is  compact,  then  the  operators  K'K  and  A' A"  are  compact,  self-adjoint,  and 
positive,  with  the  same  eigenvalues.  It  turns  out  that  these  eigenvalues  are  the  squares 
of  singular  values  of  K,  defined  in  a  way  exactly  analogous  to  the  singular  values  of  a 
matrix. 

Theorem  2.3.3  (Singular  Vector  Expansion)  Let  K  :  U\  -+  Ui  be  a  compact  linear 
operator  between  Hilbert  spaces.  The  operators  K*K  and  K Km  are  compact,  self-adjoint 
and  positive,  each  with  nonzero  eigenvalues 

— 

IV 

«OKJ 

IV 

V 

0 

where 

K^i  =  Oi4>i, 

K*<t>i  =  <70,, 

and  hence 

A'A'V,  = 

K’Kipi  =  ofrp,. 

For  all  i  and  j, 

—  b ij  1 

and 

( )  =  fiij  • 

For  any  x  €  U\,  we  have 

Kx  =  ^<r,(r,0,)d>,, 

where,  as  in  Theorem  2.3.2,  we  interpret  this  sum  to  be  over  the  (finite  or  infinitely 
many)  singular  values. 

The  positive  constants  {<r,}  are  called  the  singular  values  of  K,  and  the  two  orthonormal 
sequences  {<£,}  and  {^>,}  are  called  singular  vectors.  We  will,  on  occasion,  find  it  convenient 
to  refer  to  {<£,,  r/\ ;  o, }  as  a  singular  system.  It  is  customary  to  define  the  singular  values  of 
infinite  rank  operators  to  be  positive,  in  contrast  to  the  singular  values  of  matrices,  which 
are  nonnegative,  and  which  can  be  zero. 

2.4  Fredholm  Integral  Equations  of  the  First  Kind  in  L2 

Let  the  function  k(x,  y)  €  Z< 2 { [0 , 1]  x  [0,1]}  be  the  kernel  of  a  Fredholm  integral  equation 
of  the  first  kind: 

/  k{x,y)f(y)dy  =  g(x).  (2.45) 

Jo 

If,  in  (2.45),  /  €  Li,  then  it  can  be  shown  that  g  €  L%.  The  linear  operator  equation 
corresponding  to  (2.45)  can  be  written  as  Kf  =  g,  where  I\  :  Li[Q,  1]  — *  X,2[0, 1],  given  by 

(I\f)(x)=  I  k(x,  y)f(y)dy,  (2.46) 

Jo 

is  compact  (Young,  1988,  p.93).  If  k(x,y)  =  k(y,x),  then  we  say  that  the  kernel  is 
self-adjoint-,  if  k( x,y)  =  k(y,x)  then  we  say  k(x,y )  is  symmetric.  It  is  easy  to  see  that 
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linear  operators  K  corresponding  to  self-adjoint  (in  particular,  real  symmetric)  kernels  are 
self-adjoint.  We  will  consider  next  the  equation  Kf  =  g  where  k(x,y)  is  not  necessarily 
self-adjoint.  The  self-adjoint  case  will  not  be  discussed;  it  follows  easily,  by  means  of 
Theorem  2.3.2  from  the  more  general  results  of  this  subsection. 

Since  K  is  compact,  from  Theorem  2.3.3  there  exists  a  set  of  singular  values  of  A, 
{ci},  an  orthonormal  basis  {fa}  for  TZ(K),  and  an  orthonormal  basis  {fa}  for  TZ(K’).  It 
follows  easily  from  this  that 

V)  =  Y  a .&(*)&(»)•  (2-47) 

t 

We  will  be  prim®,  dy  concerned  with  the  case  where  k(x,  y)  is  real,  in  which  case  {fa}  and 
{0,}  can  be  taken  to  be  real  as  well.  If  there  are  infinitely  many  terms  in  the  sum  (2.47), 
then  by  the  equal  sign  we  mean  that 

N 

k(x,y)  -  '51°ifa{x)fa{y) 

1=1 

Let  {fa}  be  an  orthonormal  basis  for  H(K)L  =  Af(K *);  and  let  {fa}  be  an  orthonor¬ 
mal  basis  for  7v(A’*)X  =  Af(K).  Then  /  and  g  can  be  written  as 

/  =  £  a*^‘  +  Y  fa  (2-49) 

»  i 

and 

g  =  ^b.fa  +  'E'bjfa.  (2.50) 

•  j 

The  Fourier  coefficients  a;,  dj,  6;,  and  bj  are  easily  shown  to  be  projections  of  /erg  onto 
basis  functions;  for  example  a,  =  ( f,fa ). 

Since 

Kf  =  Y'OMirfi)  =  (2-51) 

»  I 

in  order  for  K f  —  g  to  have  a  solution,  we  must  have  that,  for  all  i,  b,  =  0  and  a,<r;  =  6,-. 
Any  solution  must  have  the  form 

/  =  Y  +  Y  hi+i  =  Y  (2-52) 

i  j  i  a' 

for  arbitrary  h  €  N{K).  For  /  to  be  a  solution,  we  must  have  that 

T-faeL2.  (2.53) 

i  ai 

It  can  be  shown  that  a  necessary  and  sufficient  condition  for  (2.53)  is  that 

lf<»-  <**> 

Thus  we  have  the  following  theorem,  proved  by  Picard  (1910)  for  linear  Fredholm 
integral  equations  of  the  first  kind,  and  later  extended  by  others  (e.g.,  Groetsch,  1980, 
pp.  156-157)  to  arbitrary  compact  linear  operators. 


lim 

N—>oo 
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Theorem  2.4.1  (Picard)  Let  K  be  compact,  with  singular  system  {&,tb,;<7,},  and  let 
g  €  Lj  be  given.  There  exists  a  function  f  such  that  K  f  =  g  if  and  only  if 

1.  £“ i  <  oo,  and 

2.  (p,u)  =  0  for  all  u  such  that  Kmu  =  0. 

2.4.1  Existence  and  Uniqueness  of  Solutions  of  Linear  Operator  Equa¬ 
tions 

We  summarize  next  the  conditions  under  which  a  solution  to  a  linear  operator  equation 
of  the  first  kind  exists,  and  the  conditions  under  which  it  is  unique. 

A  solution  to  Kf  =  g  exists  if  and  only  if  g  €  TZ(K).  If  a  solution  /  exists,  then  it  is 
unique  if  and  only  if  M(K)  =  {0}.  If  more  then  one  solution  exists,  then  the  difference 
between  any  two  solutions  is  in  and  therefore  as  a  consequence  of  the  projection 

theorem,  there  exists  exactly  one  solution  f\  €  Tl(K')  =  M’(K)1'.  Let  g  €  1Z(K),  and  let 
fi  be  the  unique  f\  X  M{ K )  such  that  K  f\  =  g.  The  set  of  all  solutions  to  K  f  —  g  is 
given  by 

^  =  {/  =  /i  +  h\KJ\  =  fl,/i  €  Af(K)L,f2  €  Af(A')}.  (2.55) 

By  the  Pythagorean  theorem, 

J/»2  =  i/i|2  +  ll/2||2.  (2-56) 

Since  for  anj  solution  /,  |/jj  >  f\,  f\  €  T  is  the  minimum  norm  solution. 

2.4.2  Infinite  Rank  Compact  Operator  Equations  of  the  First  Kind  are 
Ill-Posed 

In  Chapter  1,  we  defined  what  it  means  for  an  equation  to  be  ill- posed,  and  we  provided 
some  intuition  for  why  integral  equations  of  the  first  kind  are  often  ill-posed.  We  now 
use  the  theory  outlined  in  the  present  chapter  to  build  on  this  intuition  in  a  more  general 
context. 

The  Nature  of  the  Spectrum  of  Infinite  Rank  Compact  Linear  Operator  Equa¬ 
tions 

Let  A'  :  H  — ►  7i  be  a  compact,  positive,  self-adjoint  linear  operator  on  a  separable  Hilbert 
space.  Then  K  has  a  finite  or  countable  spectrum  of  positive  eigenvalues  (Theorem 
2.3.2).  If  K  is  of  infinite  rank,  then  K  has  infinitely  many  nonzero  eigenvalues,  and  these 
eigenvalues  must  have  zero  as  an  accumulation  point.  In  particular,  K  =  T*T  is  compact, 
positive,  and  self-adjoint  for  any  bounded  linear  operator  T,  and  the  nonzero  eigenvalues 
of  K  are  the  squares  of  the  singular  values  of  T  (Theorem  2.3.3).  Therefore  if  K  is 
compact,  of  infinite  rank,  but  not  necessarily  self-adjoint,  then  the  singular  values  of  K 
(eigenvalues,  if  I\  =  I\m)  will  have  zero  as  an  accumulation  point.  It  can  be  shown  that 
K  cannot  have  a  bounded  inverse,  and  hence  that  the  linear  operator  equation  Kf  —  g 
is  ill-posed. 

Because  of  this,  a  necessary  condition  for  this  equation  to  have  a  solution  is  that 
the  Fourier  coefficients  in  the  expansion  of  g  must  decrease  in  absolute  values  sufficiently 
rapidly  as  the  corresponding  singular  values  approach  zero,  a  result  made  precise  by 
Picard’s  Theorem  (2.4.1). 


TZ(K)  ^  72( K)  if  I\  is  Compact  and  of  Infinite  Rank:  Some  Implications  for 
Ill-Posedness 

The  second  part  of  Theorem  2.4.1  states  that  g  1  Af(Km)  (or,  if  K  is  self-adjoint, 
g  1  Af(K)).  This  ensures  that  g  6  TZ{K).  But,  as  we  shall  see  in  this  subsection,  if  K 
has  infinitely  many  non-zero  eigenvalues,  then  "R.{K )  is  not  closed.  Therefore,  the  first 
condition  in  Theorem  2.4.1  is  required  in  order  to  demonstrate  that  g  £  TZ(K).  If  g  is 
observed  with  error  and/or  represented  on  a  computer,  the  fact  that  7£(A)  ^  7v(A)  has 
important  consequences,  as  can  be  seen  by  the  following  result  of  Strand  (1974). 

Theoram  2.4.2  (Strand,  1974,  p.  801)  Let  K  be  compact,  with  eigenvalues  {A,}°Sj, 
where 

|A,|  >  |A2|  >  ...  >0.  (2.57) 

Assume  that  infinitely  many  of  these  eigenvalues  are  nonzero.  Let  g  €  72(A')  and  f  >  0 
be  arbitrary.  Then  there  exists  a  function  g  6  H  such  that: 

1.  9<tn(K), 

2.  g  1  Af(Iim),  and 
3 ■  1  <7  -  g[  <  ( 

By  definition  TZ(K)  is  dense  in  its  closure,  Al(A').  Theorem  2.4.2  states  that  'R.(K) -H(K) 
is  dense  in  TZ(K). 

This  result  provides  one  way  of  understanding  what  it  means  for  an  integral  equation 
to  be  ill-posed.  One  can  always  find  a  perturbation  of  the  right  hand  side  of  arbitrarily 
small  norm  which  changes  a  solvable  integral  equation  into  an  equation  with  no  solution. 

Actually,  the  consequences  of  ill-posedness  for  the  numerical  solution  of  integral  equa¬ 
tions  of  the  first  kind  is  somewhat  different.  When  an  equation  with  a  reasonably  smooth 
kernel  is  discretized  for  solution  on  a  computer,  the  resulting  system  of  algebraic  equations 
has  many  small  eigenvalues,  and  hence  is  very  nearly  singular.  The  exact  right  hand  side 
g(x)  and  a  representation  of  g(x)  on  a  computer  will  always  be  slightly  different,  because 
of  inevitable  roundoff  and  discretization  error.  The  solution  of  the  matrix  equation  cor¬ 
responding  to  this  slightly  perturbed  right  hand  side  will  very  likely  exist,  however  it  will 
often  be  very  different  from  the  exact  solution  /. 


2.5  Probability  Theory 

One  empirical  basis  for  mathematical  probability  lies  in  the  observation  of  the  long  range 
relative  frequency  of  ‘favorable’  events  in  the  repetition  of  a  random  experiment.  The 
theory  originated  with  the  investigation  of  games  of  chance  in  the  seventeenth  century, 
where  a  set  of  elementary  outcomes  were  treated  as  equally  likely.  A.  N.  Kolmogorov 
provided  an  axiomatic  foundation  for  probability  in  1933,  making  use  of  the  theory  of 
measure  and  integration.  The  present  section  is  a  very  brief  outline  of  the  principal  ideas 
of  probability  theory,  along  with  the  definitions  and  some  important  properties  of  certain 
probability  distributions.  There  are  many  introductory  books  at  various  levels  which  the 
reader  can  turn  to  for  details;  the  present  discussion  follows  Tucker  (1967). 
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2.5.1  Probability  Spaces 

In  order  to  have  a  rigorous  discussion  of  probability,  it  is  necessary  to  define  a  set  of 
possible  outcomes  of  a  random  phenomenon,  called  a  sample  space. 

Definition  2.5.1  (Sample  Space,  Elementary  Event)  A  sample  space  f l  is  a  set 
of  elements  or  points  u  £  Cl,  called  elementary  events,  each  of  which  is  a  possible 
outcome  of  a  random  phenomenon  under  consideration. 

Probability  is  a  set  function  which  associates  subsets  of  Cl  with  numbers  in  the  unit 
interval.  If  the  sample  space  is  uncountable,  then  it  is  necessary  to  restrict  this  set 
function  to  a  class  of  subsets  which  satisfies  the  properties  of  a  o-field  : 

Definition  2.5.2  (tx-field  )  A  set  of  subsets  S  of  Cl  is  called  a  cr-field  if 

1.  For  every  A  £  S,  Ac  £  S, 

2.  if  A\,  A2, . . . ,  An,.  . .  is  a  countable  sequence  of  elements  of  S,  then  U„,4n  £  S,  and 

3.  0  €  S. 

Subsets  A  £  S  are  called  events.  The  pair  (0,S)  is  sometimes  called  a  measurable  space. 

In  order  for  a  set  function  to  be  a  probability  or  probability  measure ,  this  function  must 
be  as  defined  in  the  following: 

Definition  2.5.3  (Probability)  A  probability  P  is  a  normed  measure  over  a  mea¬ 
surable  space  (Cl,S);  that  is  P  is  a  real-valued  function  which  assigns  to  every  A  £  S  a 
number  P(A)  such  that 

1.  P(A)  >  0  for  every  A  £  S, 

2.  P(Cl)  =  1 ,  and 

3.  if{An}™=l  is  any  countable  sequence  of  disjoint  events,  then 

P(U~=1An)  =  f;P(An). 

n=l 


A  probability  space  can  now  be  defined. 

Definition  2.5.4  (Probability  Space)  A  probability  space  is  a  triple  (Cl, S,  P),  where 
Cl  is  a  sample  space,  S  is  a  o-field  of  subsets  of  Cl,  and  P  is  a  probability  measure  on  the 
measurable  space  ( Cl,S ). 

2.5.2  Random  Variables  and  Probability  Distributions 

Often  one  cannot,  or  does  not  want  to,  observe  directly  u  £  Cl.  Instead,  what  is  measured 
or  studied  is  the  value  of  a  function  on  the  sample  space.  Such  an  5-measurable  function 
is  called  a  random  variable. 

Definition  2.5.5  (Random  Variable)  Let  (Cl,S,  P)  be  a  probability  space.  A  random 
variable,  X  :  Cl  — *  7 Z  is  a  real-valued  S -measurable  function.  That  is,  for  every  real 
number  x, 

{u;  €  CI\X(lj)  <x}£S. 
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We  will  adopt  the  convention  of  using  the  notation  X  both  for  the  random  variable  A  and 
for  a  value  of  this  random  variable  A'(u>).  We  will  denote  the  event  {u>  €  fi|A  (u>)  <  x}  by 
{A'  <  a:},  and  its  probability  by  P( X  <  x). 

Associated  with  every  random  variable  X  is  a  distribution  function  (also  called  a 
cumulative  distribution  function ,  a  cdf,  or  simply  a  distribution),  Fx(x),  which  gives  the 
probability  that  X  is  less  than  or  equal  to  any  real  number  x. 

Definition  2.5.6  (Distribution  Function)  If  X  is  a  random  variable,  its  distribu¬ 
tion  function  Fx  is  defined  by 


Fx(x)  =  P(X  <  x). 

It  can  be  shown  that  Fx  is  monotone  nondecreasing,  right-continuous,  and  that 

lim  Fx(x)  =  0, 

X— >-oc 

and 

lim  Fx(x)  =  1. 

X— »00 

It  is  straightforward  to  extend  the  definition  of  distribution  to  the  joint  distribution 
of  several  random  variables. 

Definition  2.5.7  (Multivariate  Distribution  Function)  Let  X\,...,Xn  be  random 
variables,  where  n  >  1.  The  joint  distribution  function  of  {Xj, . . . ,  A'n}  is  defined  by 

F A'l ••■**«)  =  />(n"_1{Xj  <  X,}), 

where  -oo  <  x,  <  oo,  for  1  <  i  <  n. 

A  related  concept  is  the  probability  density ,  defined  in  the  univariate  case  as  follows: 

Definition  2.5.8  (Probability  Density)  Let  Fx{x)  be  an  absolutely  continuous  distri¬ 
bution  function.  Then 

Fx(x)=  [X  fx(t)dt 

J — cc 

for  some  function  fx{t) >  called  the  probability  density  of  the  random  variable  X . 

A  random  variable  X  which  has  a  density 

/*<*>  -  ^  (2-58) 

is  said  to  be  continuous.  A  random  variable  which  takes  on  values  in  a  finite  or  countable 
set  is  said  to  be  discrete.  The  notions  of  distribution  and  density  can  be  generalized  to 
random  variables  which  assume  values  in  more  general  spaces. 

Corresponding  to  joint  distribution  functions,  there  can  be  joint  probability  densities. 
We  will  only  need  to  make  use  of  bivariate  densities.  For  example,  let  X  and  Y  be 
two  continuous  random  variables  with  joint  density  /x,y(x,y ).  The  univariate  marginal 
density  of  either  random  variable  is  obtained  by  ‘integrating  out’  the  other  variable,  for 
example 

fx(x)=  [  fx,r{x,y)dy  (2.59) 

J  —  OO 

is  the  marginal  density  of  X. 
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2.5.3  Expectation  and  Moments 

Mathematical  expectation  is  a  linear  functional  of  a  random  variable  which  models  the 
empirical  fact  of  long  run  averages. 

Definition  2.5.9  (Expectation)  The  expectation  of  a  random  variable  X ,  denoted 
E(X),  is  defined  to  be  the  Lebesgue  integral  of  X  with  respect  to  the  probability  measure 

p ■  , 

Em  =  j  xp(du), 

provided  that  this  integral  exists. 

Usually  it  is  more  convenient  to  write  this  integral  either  as  a  Lebesgue-Stieltjes  integral 
with  respect  to  the  distribution  of  a  random  variable,  or  else  as  an  integral  involving  a 
probability  density.  If  X  has  a  density  /y(x),  then  the  following  are  equal: 

E(X)  =  f  XP(du)  =  r  xdFx(x)  =  r  xfx(x)dx.  (2.60) 

J  J  —oo  J — oo 

The  expectations  of  Xn  are  of  particular  importance.  When  these  expectations  exist, 
they  are  called  moments  of  the  random  variable  X. 

Definition  2.5.10  (Moments,  Central  Moments)  The  nth  moment  of  a  random 
variable  X  is  defined  to  be  the  expectation  E(Xn),  provided  that  this  expectation  exists.  If 
E(X)  =  p,  then  the  nth  central  moment  is  defined  to  be  E{(X  —  p)n). 

The  mean  of  a  random  variable  X  is  E(X),  and  it  is  usually  denoted  p.  If  X  is  a  random 
variable  with  mean  p ,  then  the  variance  of  X,  usually  denoted  <r2,  is  £[(X  -  p)2]. 

Expectation  is  a  linear  functional;  that  is,  if  X  and  Y  are  any  random  variables  for 
which  E(X)  and  E(Y)  exist,  and  a  and  (3  are  constants,  then 

E{aX  +  (3Y)  =  aE(X)  +  0E(Y).  (2.61) 


2.5.4  Conditional  Probability  and  Independence 

Intuitively,  if  we  toss  a  coin  twice,  the  result  of  the  first  toss  has  ‘no  effect’  on  the  result 
of  the  second  toss.  We  would  say  that  these  tosses  are  ‘independent’.  This  provides  a 
motivation  for  the  concept  of  independence  in  probability  theory. 

Independence  is,  of  course,  a  special  situation.  For  example,  one  might  ask  how  one 
would  estimate  the  probability  of  drawing  the  ace  of  spades  as  a  second  card  given  each 
of  the  three  following  situations: 

1.  that  the  first  card  drawn  is  the  ace  of  spades, 


2.  that  the  first  card  drawn  is  the  eight  of  hearts,  or 

3.  no  information  on  the  first  card. 

This  leads  naturally  to  the  notion  of  conditional  probability. 

Definition  2.5.11  (Conditional  Probability)  If  (Sl,S,P)  is  a  probability  space,  and 
A,B  €  S,  with  P(A)  >  0,  then 


P(B]A>=  P(A) 

is  called  the  conditional  probability  of  B  given  A. 
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It  is  easy  to  show  that  P(-|.4)  is  a  probability  measure. 

In  the  case  of  discrete  random  variables,  it  is  easy  to  define  P(X  =  x\Y  =  y)  if 
P(Y  -  y)  is  not  zero.  In  the  case  of  continuous  random  variables,  we  always  have  that 
P(Y  =  y)  =  0,  and  conditional  probability  (as  well  as  conditional  expectation ,  to  be 
defined  below)  raises  measure  theoretic  problems.  These  are  treated  rigorously  using  the 
Radon- Nikodym  theorem  (e.g.,  Chung,  1974,  Chapter  9).  It  is  not  necessary  to  discuss 
these  technical  issues  here,  as  long  as  we  use  certain  basic  properties. 

If  X  and  Y  are  random  variables  with  a  joint  density  fx,Y{x,y),  we  will  define  the 
conditional  density  of  X  given  Y,  fx\Yix\y)i  *n  terms  of  which  we  can  compute  conditional 
probabilities  and  conditional  expectations. 

Definition  2.5.12  (Conditional  Density)  Let  X  and  Y  be  continuous  random  vari¬ 
ables,  with  marginal  probability  densities  fx(x)  and  /y(y),  and  joint  density  /x,y(x,y). 
Then,  the  conditional  density  of  the  random  variable  X  given  that  the  random  variable 
Y  equals  y  is 

t  _  fx,v{x,y) 

!x  iy(tM = “mjt- 

provided  that  fy(y)  ^  0. 

An  expectation  with  respect  to  a  conditional  distribution  is  called  a  conditional  ex¬ 
pectation.  We  assume  that  (X,  Y)  is  continuous,  so  that  fx\Yix\y)  exists. 

Definition  2.5.13  (Conditional  Expectation)  Let(X,Y )  be  continuous  random  vari¬ 
ables,  and  assume  that  the  conditional  density  fx\y{x\y)  exists.  Then  the  conditional 
expectation  of  X  given  that  Y  =  y  is  defined  to  be 

E(X\y)  =  E(X\Y  =  V)  =  J  xfx\Y(x\y)dx, 

provided  that  this  integral  exists. 

We  write  the  random  variable  E(X\Y)  by  substituting  Y  for  y  in  the  right  hand  side  of 
the  defining  equation. 

More  generally,  we  have  the  following  properties  of  E{X\Y).  Let  X\,  Xi  and  Y  be 
random  variables,  let  gq  and  g 2  be  functions  such  that  £(|fli(A'i)|)  <  00  and  £(|<72(A2)|)  < 
00,  and  let  a  and  (3  be  constants.  Then 


E{c*gy{X,)  +  fig2(X2)\ Y)  =  aE^X^Y]  +  /3E[P2(A2)|y],  (2.62) 

E{E((/1(A1)|y]}  =  £;[<7i(X1)],  (2.63) 

EfaWfniX^Y}  =  92(Y)E[gl(X1)\Y),  (2.64) 

and  also 

P(A\Y)  =  £(l*|y),  (2.65) 

where  I4  is  the  indicator  random  variable  corresponding  to  the  event  A,  defined  by 


1 /»(“>) 


(  1  if  u>  €  A 
(  0  if  uj  £  A. 


(2.66) 


If  two  events  are  such  that  P(A]B)  =  P(A),  or  equivalently  P(A  D  B)  =  P(A)P(B), 
then  the  events  A  and  B  are  said  to  be  independent.  More  generally,  we  have  the  following 
definition: 
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Definition  2.5.14  (Independent  Events)  Let  B  =  {Ba,a  €  1}  be  a  set  of  events. 
These  events  are  said  to  be  independent  if  for  every  positive  integer  n  and  every  n 
distinct  elements  ai,...,an  in  the  indexing  set  I,  we  have  that 

n 

P(BaiC\...nBan)  =  Y[P(Ba,). 

t=l 

If  all  events  involving  X  are  independent  of  those  involving  y,  i.e.  {X  €  .4}  and  { Y  €  B} 
are  independent  for  all  sets  A  and  B ,  then  X  and  Y  are  said  to  be  independent  random 
variables.  In  this  case,  Fx,Y(x,y)  =  Fx(x)FY(y ),  and,  if  X  and  Y  are  continuous, 
fxy{x,y)  =  fx(x)fY(y),  and  fx\Y(x\y)  =  fx(x).  More  generally,  we  have  that: 

Definition  2.5.15  (Independent  Random  Variables)  Let  {XQ,a  €  7}  be  a  family 
of  random  variables.  These  random  variables  are  said  to  be  independent  if,  for  every 
positive  integer  n  and  every  n  distinct  elements  ai  ...a„  in  the  indexing  set  I,  we  have 
that 

n 

Fxai,...,Xan(xU  •  ■  •  iXn)  =  J"J  Fxai  (*«')• 

i=l 

If  X  and  y  are  independent  random  variables,  then,  for  any  functions  h\(x)  and  /i2(y) 
for  which  £[|M*)|]  <  oo,  £[|fca(y)|]  <  oo 

£M*)My)]  =  £(M*)]£[My)l,  (2-67) 

and 

£(Jf|y)  =  E(X).  (2.68) 

We  also  will  be  making  use  of  the  following  results  for  independent  random  variables: 

P[X  <  h(Y)}  =  E  {Fx[h(Y )]}  =  J  Fx[h(y))fY(y)dy,  (2.69) 

and  the  variance  of  X  +  Y  is  the  sum  of  the  variances  of  X  and  Y. 


2.5.5  Some  Distribution  Theory  for  Statistics 

We  will  make  extensive  use  of  several  special  continuous  probability  distributions  of  im¬ 
portance  to  statistics.  In  this  section,  we  define  those  probability  distributions  which  we 
will  use  in  this  thesis,  and  we  state  some  important  properties  and  relations. 

All  of  these  distributions  are  related,  directly  or  indirectly,  to  the  standard  normal 
distribution,  denoted  $(x)  and  defined  by 


$(z)  = 


The  corresponding  standard  normal  density  is 


4>(x) 


d*(x)  =  _J_e-x*/2 
dx  split 


(2.70) 


(2.71) 


If  X  has  a  standard  normal  distribution,  we  indicate  this  by  X  ~  N(0, 1),  where  4~’  is 
read  ‘is  distributed  as’  and  the  arguments  of  N  indicate  that  X  has  a  mean  of  zero  and  a 
variance  of  one. 
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The  six  densities  which  we  will  use  are  defined  below,  where  we  adopt  the  convention  of 
separating,by  a  semicolon,  parameters  which  define  special  cases  of  a  class  of  distributions, 
from  the  possible  value  x  of  the  random  variable. 


fi(x;p,<?2)  = 

1 

- -p=e  = 

<7v2x 

:  I# 

a 

(^) 

(2.72) 

xu/2-  le-x/2 

f2{x\u)  = 

r(J//2)2*'/2 

(2.73) 

/3(x;Ai,a2)  = 

r(Aj  +  A2)  a,_ 
r(A,)r(A2) 

‘(1- 

x)Aa_1 

(2.74) 

Mx,v 1,1/2)  = 

r[(i/i  +  i/2)/2] 
r(i/1/2)r(i^/2) 

(S) 

»l/2 

(2.75) 

W2-1 


[1  +  (i/1/i/2)xf'>+‘'^2 

«*■'>  s  +  (2.76) 

/«(*!»,<)  3  [r(^/2)2"2-ip'  (2.77) 

The  following  are  listed  below  for  each  of  these  densities: 

•  Notation  for  the  corresponding  distribution, 

•  The  interval  over  which  the  density  is  nonzero  (the  support),  and 

•  The  mean  and  variance,  if  necessary: 

1.  f\  is  the  normal  density,  with  distribution  denoted  N(;i,<t2),  with  support  the  real 
line,  and  with  mean  p  and  variance  cr2; 

2.  fi  is  the  x2  density  with  v  degrees  of  freedom,  with  distribution  denoted  with 
support  the  positive  reals,  and  with  mean  v  and  variance  2i/; 

3.  fz  is  the  Beta  density,  with  distribution  denoted  Beta  (Ai,A2),  with  support  [0,1], 
and  with  mean  Ai/(Ai  +  A2); 

4.  fi  is  the  F  density  with  ux  and  1/2  degrees  of  freedom,  with  distribution  denoted 

and  with  support  the  positive  reals; 

5.  fs  is  the  t  density  with  v  degrees  of  freedom,  with  distribution  denoted  Tv,  with 
support  the  real  line,  and  with  mean  zero; 

6.  /is  is  the  noncentral  t  density  with  v  degrees  of  freedom  and  noncentrality  parameter 
6,  with  distribution  function  denoted  Tv(6),  and  with  support  the  real  line. 

In  addition,  we  note  that  if  Z  ~  N(p,o2),  then  if  n  is  an  integer, 

E[(Z  -  /i)n]  =  0  (2.78) 
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for  n  odd,  and 


(2.79) 


n!  an 

£[(  “/*)]-  („/ 2)!2"/2 

for  n  even. 

Let  {X,}”=1  denote  a  sequence  of  random  variables.  We  call  such  a  sequence  a  random 
sample.  If  the  AT;  are  independent  and  identically  distributed,  we  use  the  notation  iid. 
The  sample  mean  and  variance  are  defined  as  follows: 

Definition  2.5.16  (Mean  and  Variance  of  a  Sample)  Let  {Af,}"-!  be  a  random  sam¬ 
ple.  The  sample  mean  and  sample  variance  are 


(2.80) 


S2  =  £(X,-X)2/(n-l), 


(2.81) 


respectively. 

If  the  {AT,}  are  iid  normally  distributed,  then  the  following  important  result  holds: 

Theorem  2.5.1  (Distribution  of  the  Mean  and  Variance  of  a  Normal  Sample) 
Assume  {X,}"=1  are  iid  N (p,o2).  The  sample  mean  and  variance,  X  and  S 2  respectively, 
are  independent, 

X  ~  N(p,<r2/n), 

and 

( n  -  1)52  2 

- 2 - Xn-1’ 

(7 

The  following  lemmas  relate  some  of  the  random  variables  whose  distributions  were 
defined  above.  These  results  are  important,  and  the  proofs  are  omitted  here.  In  a  more 
leisurely  presentation,  many  of  these  ‘results’  would  be  used  as  defining  the  corresponding 
random  variables,  and  the  distributions  would  be  derived  from  those  definitions. 

Lemma  2.5.1  (Sums  of  Normal  Random  Variables)  //  X  ~  N(pi,<r^)  and  Y  ~ 
N(/i2^<T|).  where  X  and  Y  are  independent,  and  a, b  €  TZ  are  arbitrary  constants,  then 

aX  +  bY  ~  N(a/ij  +  bfi2,  fl2^i  +  h2^). 


aX  +  b  ~  N(a/ij  +  6,a2cr2). 

Lemma  2.5.2  (Sums  of  Squares  of  Normal  Random  Variables)  //{Af,}"=1  are  iid 
N(0, 1),  then 


Lemma  2.5.3  (Sums  and  Quotients  of  x2  Random  Variables)  Let  X  ~  xlj  and 
Y  ~  x2j>  where  X  and  Y  are  independent.  Define  the  three  random  variables  Z\  =  X  +  Y , 
Z7  s  X/(X  +  7),  and  Z3  =  {X  lux)l(Y  lv2).  Then 
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1.  Z\  and  Zi  are  independent, 


~  Xi/i+i/j’ 

8.  Zi  ~  Beta  (iq/2,  1/2/2),  and 
4.  Z3  ~  F^iVj. 


Lemma  2.5.4  (Student’s  t  Distribution)  If  Z  ~  N(0, 1)  and  I'  ~  xj>  where  Z  and 
y  are  independent,  then 

2  +  6  Tv{6 ), 


and 


s/YjZ 

Z 

s/Ylv 


Tu. 


It  is  customary  to  use  capitals  for  random  variables  and  Greek  letters  for  parameter  values. 
There  are  exceptions  often  due  to  ancient  conventions. 


2.5.6  Some  Limit  Theorems 

Two  important  limit  theorems  concerning  the  behavior  of  the  average  X  of  n  iid  random 
variables  A'i,  A2, . . .  ,Xn  are  the  Law  of  Large  Numbers  and  the  Central  Limit  Theorem. 
In  order  to  state  these,  we  need  to  define  two  forms  of  convergence  for  a  sequence  of 
random  variables. 

Definition  2.5.17  (Convergence  in  Probability)  Let  {A„}  be  a  sequence  of  random 
variables.  We  say  that  {A'n}  converges  in  probability  to  X  if  for  every  t  >  0 

P(\Xn  -  X\  >  e)  ->  0 

p 

as  n  -*  00,  and  we  write  X„  — *  X.  X  can  be  either  a  constant  or  a  random  variable. 

Definition  2.5.18  (Convergence  in  Distribution)  If  X  is  a  random  variable  with 
distribution  Fx{x),  and  if  {  An}  is  a  sequence  of  random  variables  with  distributions 
{FjC„(*n)}i  then  we  say  that  Xn  converges  in  distribution  to  X,  and  we  write  Fxn  —* 
Fx,  if  for  all  points  of  continuity  x  of  Fx(x) 

lim  FXn(x )  =  Px(x). 

n— *oo 

The  Law  of  Large  Numbers  states  that  the  average  of  iid  random  variables  with  finite 
mean  converges  in  probability  to  that  mean;  in  other  words,  that  expectation  has  been 
properly  defined  to  model  long  run  averages. 

Theorem  2.5.2  (Law  of  Large  Numbers)  Let  {A/}  be  a  sequence  of  iid  random  uari- 
ables  with  mean  p,  and  let  Xn  be  given  by 


Xn  =  ^Xi/n. 

1=1 


Then  Xn  £  p. 
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If  the  variance  is  finite,  then  the  Central  Limit  Theorem  tells  us  more,  i.e.  that  Xn  is 
approximately  normally  distributed  with  mean  p  and  variance  <r2/n. 

Theorem  2.5.3  (Central  Limit  Theorem)  Let  {X,}  be  a  sequence  ofiid  random  vari¬ 
ables  with  mean  p  and  variance  o2  <  oo.  Then 

■En<0’1)- 

We  can  now  immediately  derive  several  important  results  involving  some  of  the  dis¬ 
tributions  introduced  in  the  previous  subsection: 

Lemma  2.5.5  Let  An  ~  Xn>  ani *  ^  and  Tn(6)  denote  Student  t  random  variables. 
Then, 

An  =  n  +  y/2 nUn 

where  Un  -*  N(0, 1)  as  n  — »  oo.  Also,  as  n  — *■  oo,  we  have  the  following: 

1.  An/n  1, 

2.  Tn  £  N(0, 1),  and 

3.  Tn(6)  -B.  N(£,  1). 

2.6  A  Decision-Theoretic  Approach  to  Estimation  and  Hy¬ 
pothesis  Testing 

Statistical  decision  theory,  a  theory  of  decision  making  in  the  presence  of  uncertainty, 
extends  and  unifies  much  of  classical  statistical  inference.  Statistical  decision  theory  was 
first  studied  extensively  by  Abraham  Wald  in  the  1940’s.  Two  useful  texts  which  were 
consulted  in  the  preparation  of  this  section  are  Chernoff  and  Moses  (1957)  and  Berger 
(1985). 

In  the  present  section,  after  introducing  some  of  the  ideas  of  statistical  decision  theory 
we  show  how  the  classical  statistical  problems  of  estimation  and  hypothesis  testing ,  which 
will  concern  us  in  this  thesis,  can  be  regarded  as  special  cases  of  this  general  theory. 
Finally,  we  will  illustrate  each  of  these  two  classes  of  problems  with  an  example. 

2.6.1  Decision-Making  Under  Uncertainty 

A  simple  decision-making  problem  under  uncertainty  can  be  modeled  as  follows.  Given 
a  set  A  of  possible  actions  a  €  A,  a  choice  of  action,  or  decision,  has  to  be  made.  The 
consequence  of  this  decision  depends  on  the  unknown  state  of  nature  6  €  0.  Thus,  for 
each  action  a  and  state  9,  there  is  a  consequence  (which  may  depend  in  part  on  chance). 

For  any  individual  whose  preferences  satisfy  some  modest  assumptions,  consequences 
can  be  represented  by  a  real  valued  utility  measure  which  has  the  following  properties 
(e.g.,  Chernoff  and  Moses,  1957,  Chapter  4): 

1.  The  higher  utility  goes  to  the  preferred  consequence 
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2.  If  the  consequences  have  random  components,  then  the  utility  for  a  random  situation 
can  be  evaluated  as  the  mathematical  expectation  of  the  corresponding  utilities,  even 
though  we  are  not  involved  in  a  long  run  average  situation. 

Because  statisticians  prefer  to  measure  how  much  they  lose  because  of  ignorance,  it  is 
conventional  to  use  losses  in  place  of  utilities,  where  we  can  define  loss  as  negative  utility. 
Thus  the  consequences  can  be  represented  by  a  loss  function  L{9,a).  Now  we  are  in  the 
position  of  having  a  game  of  a  statistician  with  nature.  Nature  picks  9  €  0,  and  the 
statistician  in  ignorance  of  6  picks  a  €  A.  The  game  (in  normal  form)  is  represented 
by  L(9,a).  By  performing  an  experiment,  the  statistician  has  an  opportunity  to  obtain 
information  about  the  state  of  nature.  Unfortunately,  most  experiments  are  less  than  fully 
informative;  they  do  not  tell  us  9,  but  rather  they  provide  data  in  the  form  of  a  random 
variable  X  which  takes  on  values  in  X,  the  distribution  of  which  depends  on  the  state. 
The  help  that  we  get  from  the  data  depends  on  the  extent  to  which  the  distribution  of 
the  data  depends  on  9.  Having  observed  the  data,  the  statistician  must  incorporate  that 
information  in  his  decision  making.  He  does  so  by  selecting  his  action  as  a  function  of 
X .  Thus  we  have  the  decision  function  6  :  X  — *  A  or  6(X)  =  A,  where  the  resulting 
action  A  is  ordinarily  random,  since  it  depends  on  the  data  X.  Occasionally  we  will 
use  the  terminology  of  game  theory  and  refer  to  a  decision  function  as  a  strategy.  The 
consequence  of  using  6  when  the  state  of  nature  is  9  is  measured  by  the  expected  loss  as 
a  function  of  0,  called  the  risk 

R(0,S)  =  Ee[L(9,A )]  =  Ee[L(9,6(X))),  (2.82) 

where  the  subscript  represents  expectation  with  respect  to  the  distribution  of  X,  when  9 
is  the  state  of  nature. 

By  introducing  the  experiment  we  have  changed  our  relatively  simple  problem  into  a 
more  complicated  looking  problem  of  the  same  form:  where  the  statistician  chooses  the 
decision  function  while  nature  still  chooses  the  state.  However,  we  have  lost  nothing  and 
possibly  gained  something,  because  among  our  decision  functions  are  those  which  ignore 
the  data.  Typically,  with  informative  experiments,  we  can  do  better  than  before. 

2.6.2  Admissibility  and  Bayes  Risk 

Let  X  be  a  random  variable,  with  distribution  function  Fx{x\9)-  On  the  basis  of  X, 
we  choose  an  action  by  means  of  the  decision  function  6  :  X  — ♦  A.  We  would  like  to 
choose  a  6  which  makes  R{9,6)  small  for  all  9  €  0.  Of  course,  we  may  have  two  decision 
functions  and  6 2,  for  which  R(9,6 1)  <  R{9,f>i)  for  some  values  of  9,  but  for  which 
R(0,Si)  >  R(0,62)  for  some  other  values  of  0.  In  this  case,  we  cannot  say  which  of  6j 
and  62  is  preferable  on  the  basis  of  R(9,6)  alone.  However,  if  R(0,6i)  <  /?(<?, ^2)  for  all 
0,  then  b\  is  clearly  preferable.  A  decision  function  6,  dominates  a  decision  function  6 
if  R{9,6.)  <  R(9,6  for  all  6  and  R(9,6.)  <  R{6,f>)  for  some  9.  A  decision  function  is 
inadmissible  if  it  is  dominated  by  some  other  strategy,  and  admissible  otherwise. 

It  is  natural  for  an  optimizer  to  insist  that  we  select  only  admissible  strategies,  but 
that  rarely  solves  the  dilemma  of  how  to  select  a  decision  function.  Occasionally,  however, 
we  do  have  a  situation  where  a  certain  type  of  problem  recurs  frequently,  and  through 
past  experience  we  learn  that  the  V  values  behave  like  random  variables  with  a  known 
probability  distribution  n(0);  for  simplicity  of  presentation  we  will  take  9  to  be  continuous, 
with  density  ir{9).  In  those  cases  we  can  evaluate  our  decision  function  by  minimizing  the 
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Bayes  Risk 


r(S)  =  E{E6[L(0,i)}}  =  J  Ee[L(8,6))x(0)d8.  (2.83) 

A  decision  function  which  minimizes  the  Bayes  risk  is  called  a  Bayes  strategy. 

Two  facts  concerning  Bayes  strategies  are  of  particular  importance.  The  first  of  these  is 
a  theorem  which  states  that  under  suitable  regularity  conditions  every  admissible  strategy 
is  a  Bayes  strategy  or  a  limit  of  Bayes  strategies.  The  second  asserts  that  it  is  often 
relatively  easy  to  find  a  Bayes  strategy  by  using  the  data  X  to  replace  the  prior  distribution 
x  by  a  posterior  distribution 


r.(9)  =  *(*)/*(*;*) 

ffx(X;0)x(0)d0 ’ 

where  fx(x;0)  is  the  density  of  X.  Then  we  select  A  =  S(X)  as  a  value  a  which  minimizes 
the  posterior  risk,  conditional  on  the  data  X, 

E“[L(0,a)]  =  J  L(0,a)x‘(8)d0,  (2.85) 

where  6  is  a  random  variable  with  posterior  distribution  tt*.  Here  X  is  present  implicitly, 
since  ir *(0)  depends  on  X . 

2.6.3  Philosophies  of  Inference 

The  data  X  may  have  reduced  uncertainty  due  to  the  unknown  state  of  nature,  but  it 
is  seldom  the  case  that  there  exists  a  decision  function  6(A)  which  dominates  all  others. 
There  are  two  primary  (and  several  secondary)  schools  of  thought  on  how  to  select  ‘good’ 
decision  functions  when  such  a  selection  cannot  be  done  on  the  basis  of  R(6,6)  alone:  the 
frequentist  and  Bayesian  philosophies  of  inference. 

The  term  ‘frequentist’  is  a  misnomer.  It  suggests  the  use  of  long  run  average  which 
is  not  relevant.  A  distinction  between  the  two  schools  is,  rather,  that  the  frequentist 
tries  to  be  objective  while  the  Bayesian  is  subjective.  There  is  a  theorem,  very  much  like 
the  theorem  that  gives  rise  to  utility,  that  states  that  if  a  decision  maker  acts  coherently 
on  related  problems,  he  must  be  acting  as  though  he  has  a  prior  probability  (Ferguson, 
1967,  pp.  17-22).  Using  conditional  probability,  we  can  show  how  this  prior  changes  with 
additional  information,  but  this  theorem  does  not  say  where  the  prior  comes  from.  A 
weakness  of  the  Bayesian  philosophy  is  that  when  we  replace  our  vague  feelings  about 
the  prior  by  some  approximation,  that  approximation  may  carry  more  information  than 
we  really  feel  we  have.  The  solution  based  on  the  approximation  may  be  far  from  an 
approximation  to  the  solution,  and  there  is  a  resulting  lack  of  robustness.  The  fact  that 
Bayesians  are  subjective  is  also  perceived  by  many  to  be  a  weakness.  On  the  other 
hand,  frequentists  try  to  find  a  procedure  which  will  not  do  poorly  no  matter  what  the 
true  state  of  nature.  A  shortcoming  of  this  approach  is  that  whatever  criterion  that  a 
frequentist  might  suggest,  it  will  either  be  equivalent  to  a  Bayesian  criterion,  or  else  it  will 
lead  to  paradoxes  because  of  the  theorem  on  coherent  decision  making.  If  the  criterion  is 
equivalent  to  a  Bayesian  one,  then  the  prior  is  likely  to  have  been  chosen  for  mathematical 
convenience,  and  it  might  not  be  a  reasonable  reflection  of  prior  experience. 
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2.6.4  Estimation 

In  problems  of  statistical  inference,  functions  of  the  observed  data  A"  are  usually  called 
statistics,  and  the  state  of  nature  9  is  called  a  parameter  in  a  parameter  space  0.  Two 
broad  classes  of  statistical  problems  are  problems  of  estimation  and  hypothesis  testing, 
and  we  briefly  consider  estimation  next. 

A  decision  problem  for  which  knowledge  of  0  would  suggest  that  the  best  action  to 
take  is  g(6)  is  called  an  estimation  problem,  and  the  corresponding  decision  function  is 
called  an  estimator.  Typically,  for  such  a  problem  the  loss  will  depend  on  how  close  a 
is  to  g(6).  Ordinarily  a  smooth  loss  function  can  then  be  approximated  by  squared-error 

L(6,a)  =  (a-g(0))\  (2.86) 

In  those  cases  we  want  a  decision  function  6  for  which  the  mean  square  error 

R(9,6)  =  EemX)-g(0 ))2]  (2.87) 

is  small.  Let  the  expected  value  of  an  estimator  6(A")  be  denoted  ps{9)-  Then  the  risk 
R(0,S)  can  be  written  as  a  sum  of  two  terms 

R(9,6)  =  Ee[(6(X)  -  ps(0))2)  +  [g(9)  -  p6(0)}2.  (2.88) 

The  first  term  in  (2.88)  is  the  variance  of  the  estimator  6(AT),  and  the  second  term  is  the 
square  of  the  bias  of  6(X). 

Admissibility  does  not  do  much  to  help  reduce  the  class  of  available  estimators  in  this 
case.  To  see  this,  let  S0  =  90,  for  any  value  0O  of  6,  and  note  that,  for  the  loss  (2.86), 
R(0O,SO)  =  0.  Although  6o  makes  no  use  of  the  data  X,  it  is  at  least  as  good  as  any 
decision  function  when  the  true  parameter  is  0O. 

For  the  Bayesian,  the  Bayes  strategy  for  squared-error  loss  would  be  the  mean  of  the 
posterior  distribution  of  g(0).  A  non- Bayesian  can  eliminate  ridiculous  strategies  such 
as  the  guess  9  =  6q  above  by  restricting  the  class  of  decision  functions  to  be  considered. 
Often  this  is  done  by  restricting  consideration  to  unbiased  estimators.  An  estimator  6  is 
an  unbiased  estimator  of  g(9)  if 

Ee[6(X)}  =  g(0)  (2.89) 

for  all  9. 

Among  unbiased  estimators  for  a  particular  estimation  problem,  one  can  often  deter¬ 
mine  an  estimator  6u(X)  which  minimizes  the  risk  (2.88).  Since  squared-error  loss  for  an 
unbiased  estimator  is  the  same  as  variance,  we  call  such  a  Sa(X)  a  minimum  variance 
unbiased  estimator. 

2.6.5  Hypothesis  Testing 

A  decision  problem  with  only  two  actions  is  called  a  hypothesis  testing  problem  for  reasons 
that  will  become  clear  shortly.  We  can  divide  up  the  class  0  of  states  of  nature  into  two 
sets:  one  set,  0o,  for  which  one  of  the  actions,  say  ao,  is  the  best  action,  and  another  set, 
0i  =  0  -  0O,  for  which  the  other  action,  say  aj,  is  the  best  action.  Thus,  we  can  identify 
ao  with  accepting  the  hypothesis 

H0  :9e  0o,  (2.90) 

and  ai  with  accepting  the  alternative  hypothesis 

:0€0 ,.  (2.91) 
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Any  decision  function  6  consists  of  dividing  up  the  set  X  of  possible  observations  into  two 
subsets:  Uq  and  U\  =  X  —  Uq.  Observations  in  Uo  lead  to  accepting  Ho,  and  observations 
in  U\  lead  to  accepting  H\.  The  risk  R(8,6)  depends  on  both  the  cost  of  making  the 
wrong  decision  and  on  the  probability  of  making  the  wrong  decision,  when  8  is  the  state 
of  nature.  For  example,  if  we  associate  a  loss  of  zero  with  a  correct  decision,  then 

R(9,6)  =  L(0,qi)P$(X  €  U\)  for  9  €  ©o>  (2.92) 

R(6,6)  =  L(9,a0)P6(X  €  U0)  for  9  €  ©i- 

Historically,  the  theory  of  hypothesis  testing  developed  slowly  in  several  stages,  before 
the  introduction  of  decision  theory.  In  the  first  stage  of  significance  testing ,  the  formulation 
was  incomplete  and  no  attention  was  paid  to  the  alternative  hypothesis  nor  to  the  cost 
of  making  the  wrong  decision.  Typically  one  wished  to  establish  that  some  treatment 
had  an  effect.  A  null  hypothesis  Ho  would  be  formulated  to  state  that  the  treatment 
had  no  effect.  (The  action  in  the  real  world  corresponding  to  rejecting  the  hypothesis 
that  there  is  no  effect  would  be  to  continue  research  in  that  direction  or  to  decide  to 

apply  the  treatment.  Accepting  the  hypothesis  would  presumably  lead  to  giving  up  on 

the  treatment.)  A  statistic  T  would  be  introduced  which  would  measure  how  inconsistent 
the  data  are  with  the  null  hypothesis,  and  would  lead  to  rejection  if  T  were  large  enough. 

For  example,  assume  that  our  experiment  consists  of  n  iid  observations  .  .,Xn 

from  a  N(/x,o2)  distribution  where  a2  is  known,  and  that  our  null  hypothesis  is 

Ho:p  =  0.  (2.93) 

A  reasonable  statistic  to  use  in  assessing  evidence  against  H0  appears  to  be  the  absolute 
value  of  the  sample  mean,  |X|,  since  |Xj  estimates  |/i|,  and  so  large  values  of  |Xj  suggest 
that  the  data  are  inconsistent  with  Ho.  We  propose  the  test  ‘reject  Ho  if 

T  =  | A' |  >  lM(j/Vn\  (2.94) 

The  probability  of  rejecting  the  null  hypothesis  when  the  null  hypothesis  is  true  is  called 
the  significance  level  or  the  size  of  a  hypothesis  test,  and  usually  denoted  a.  For  our 
example,  the  constant  1.96  was  chosen  so  that  a  =  .05.  It  is  important  to  choose  a 
significance  level  before  examining  the  data.  A  measure  of  the  consistency  of  the  data 
with  a  null  hypothesis  is  the  P-value,  which  is  the  smallest  significance  level  for  which  the 
null  hypothesis  can  be  rejected.  Thus,  a  test  statistic  which  would  yield  a  P-value  of  less 
than  .05  would  be  regarded  as  significant  at  the  .05  level  and  lead  to  rejection  if  a  .05  level 
test  were  used.  In  this  case  a  P-value  of  .0001  would  be  regarded  as  highly  significant, 
and  would  be  of  interest  to  the  statistician  who  isn’t  completely  bound  by  formalism,  but 
in  principle  it  would  lead  to  the  same  conclusion  as  a  P-value  of  .0499. 

When  a  test  is  of  the  form  ‘reject  Hq  if  T  >  k\  k  is  sometimes  called  a  critical  value. 
Traditionally  k  is  a  constant,  although  we  will  consider  situations  in  which  k  is  a  function 
of  the  data,  and  we  will  approximate  the  functional  form  of  this  statistic  k ,  in  order  to 
acheive  certain  as  yet  unspecified  aims,  by  attempting  to  solve  an  integral  equation. 

The  above  example  problem  becomes  more  complicated  if,  as  is  common  in  real  ap¬ 
plications,  o  is  unknown. 

Then  a  particular  test  of  the  form  ‘reject  Ho  if  T  >  1.645<ro/ y/n',  where  oo  is  some 
constant,  has  the  undesirable  result  that  the  probability  of  rejecting  the  hypothesis  de¬ 
pends  on  the  nuisance  parameter  o,  which  is  not  of  major  interest  in  itself.  In  fact  the 
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probability  of  rejecting  the  hypothesis  Hq  :  X  ~  N( 0,<r2),  with  unknown  positive  cr, 
varies  from  0  to  1  as  a  varies  over  the  interval  (0,  oo).  This  problem  was  resolved  by  W. 
S.  Gossett,  using  the  pseudonym  ‘Student’,  who  suggested  the  use  of  the  test  procedure: 
‘reject  Ho  if 

r  =  1Hj=>f.  (2.95) 

Here  A;  is  a  constant  critical  value,  and  the  denominator  is  an  estimate  of  a/y/n.  The  test 
(2.95)  resembles  the  previous  test  (2.94),  with  the  known  standard  deviation  replaced  by 
its  estimate.  When  Hq  is  true,  the  probability  of  falsely  rejecting  Ho  is  determined  from 
Student’s-t  distribution,  and  depends  only  on  the  choice  of  k  and  n  -  1;  it  is  independent 
of  the  nuisance  parameter.  Test  procedures  for  which  the  probability  of  rejection  when 
the  hypothesis  is  true  does  not  depend  on  the  nuisance  parameter  are  called  similar. 

A  later  stage  in  the  development  of  the  theory  of  hypothesis  testing  came  out  of  the 
realization  that  the  significance  theory  did  not  give  any  formal  suggestions  for  selecting 
one  test  statistic  over  another.  Neyman  and  Pearson  introduced  the  notion  of  alternative 
hypotheses.  They  formulated  the  problem  of  minimizing  the  probability  of  accepting  the 
hypothesis  when  it  is  false,  given  the  size  or  significance  level  of  the  test.  Then  the  above 
problem  could  be  stated  as  one  where  we  observe  iid  observations  which  are  N(p,  cr2), 
where  6  =  (p,  a)  and  it  is  desired  to  test 

H0  :  9  6  Go  =  {9  :  p  =  0,0  <  a  <  oo}  (2.96) 

against  the  alternative 

Hi  :0€©i  =  {0:/i/O,O<<r<oo}.  (2.97) 


Here,  one  is  interested  in  the  power  function  which  measures  the  probability  of  rejecting 
the  hypothesis  for  all  possible  values  of  9.  In  our  example  above  the  power  function  of 
the  t-test  suggested  depends  only  on  k,  n,  and  the  noncentrality  parameter  S  =  s/n^/a. 
To  see  this,  note  that  (2.95)  can  be  written  as 


T  = 


(2.98) 


I  *- 


S/a 


Z  +  6 
Y  ' 

where  6  =  y/n\p\/a,  Z  ~  N(0,1),  ( n  -  l)y2  ~  Xn-i>  and  Y2  (hence  Y)  is  independent 
of  Z.  Therefore,  T  is  distributed  as  the  absolute  value  of  a  noncentral-t  random  variable, 
with  (n  —  1)  degrees  of  freedom  and  noncentrality  parameter  )6). 

In  general,  for  composite  hypotheses,  the  size  of  a  test  is 


q  =  sup  Pg  (  Reject  H0)  •  (2.99) 

The  Student  t  test,  described  in  (2.95),  can  be  shown  to  be  optimal  among  size  a  tests 
for  which  the  power  is  symmetric  in  the  parameter  p. 

This  theory  fails  to  give  formal  consideration  to  the  cost  of  incorrect  decisions,  but 
there  was  always  some  sort  of  informal  attention  paid  to  cost,  in  order  to  rationalize 
the  selection  of  good  significance  levels  of  the  test  procedures.  It  is  implicit  in  that  the 
Neyman-Pearson  theory  tends  to  treat  the  two  hypotheses  asymmetrically. 
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2.6.6  Confidence  Intervals 


So  far  the  theory  of  estimation,  as  expressed  above,  does  not  pay  much  attention  to  how 
reliable  the  estimates  are.  Ordinarily,  the  statistician  or  scientist  wants  to  know,  for 
his  real  decision  making,  which  depends  only  in  part  on  his  estimate,  how  reliable  this 
estimate  is.  Traditionally,  one  accompanies  an  estimate  of  g(6)  with  an  estimate  of  how 
variable  that  estimate  is.  Philosophically  this  puts  us  in  a  problem  of  estimating  the 
variance  of  the  estimate  of  the  variance  of  the  ...  of  the  estimate.  That  problem  can  be 
resolved  by  the  use  of  confidence  intervals  or  regions. 

A  confidence  interval,  or  more  generally,  a  confidence  region ,  is  a  random  set  which 
contains  the  true  value  of  a  (scalar  or  vector)  parameter  with  at  least  a  specified  proba¬ 
bility,  or  confidence.  Let  U(X)  be  a  subset  of  the  parameter  space  0  which  depends  on 
the  data  X.  If,  for  all  9  €  0 

P»ls(0)€W(X)]>7,  (2.100) 

where  the  probability  is  determined  from  the  distribution  Fx(x\9)  of  the  data,  then  the 
region  U(X)  is  called  a  confidence  region  for  g{6)  of  confidence  at  least  7. 

For  example,  if  X,  ~  N(jx,ff2)  for  z  =  1,. . . ,  n,  6  =  (p,o2),  and  X  and  S2  are  the 
sample  mean  and  variance,  then 


T  = 


X-n 

S/y/n 


Tn-\ , 


(2.101) 


where  t„_i  denotes  the  Student-t  distribution  with  n  -  1  degrees  of  freedom,  and  hence, 
for  all  p 

Pel X  -  t„_i(a/2)5/>/n  <  p  <  X  +  t„_,(a/2)5/>/n]  =  7,  (2.102) 

where  a  =  1  -  7  and  P(T  >  tn- i(or/2))  =  a/2.  The  random  interval 


r  :  (X  -  *„_,(a/2)S/v/£,  X  +  <n_1(a/2)5/>/n)  (2.103) 


contains  p  with  probability  7,  and  we  say  that  T  is  a  1007%  confidence  interval  for  p.  To 
be  more  specific,  T  is  a  two-sided  interval ;  we  can  also  construct  one-sided  intervals  if  we 
are  interested  only  in  a  lower,  or  upper,  confidence  limit  on  p. 

A  hypothesis  under  which  the  parameter  equals  a  specific  point  in  the  parameter 
space  is  called  a  simple  hypothesis ;  the  complementary  situation  is  called  a  composite 
hypothesis.  There  is  a  one-to-one  relationship  between  simple  hypotheses  and  confidence 
intervals:  given  a  confidence  interval  of  confidence  1  —  a  for  60  the  test  ‘reject  Ho  :  6  =  60 
if  Oq  is  not  in  this  confidence  interval’  is  a  hypothesis  test  of  size  a. 

Actually,  there  can  be  a  one-to-one  relationship  between  confidence  intervals  and  hy¬ 
pothesis  tests  even  when  the  null  hypothesis  is  composite,  and  the  confidence  interval 
(2.103)  provides  one  such  example.  The  interval  (2.103)  corresponds  to  the  composite 
null  hypothesis 

H0  :  {(p,v2)  :  p  =  /iO,0  <  <r2  <  00}  (2.104) 

together  with  the  composite  alternative 

Hj  :  {(ji,<r2)  :  p  ±  poi 0  <  a1  <  00}.  (2.105) 


A  test  of  Hq  with  alternative  Hi  of  size  a  is  provided  by  the  criterion  ‘reject  Ho  if  (2.103) 
does  not  contain  po  -  The  reason  why  the  interval  (2.103)  corresponds  to  a  hypothesis 
test  is  that  the  relevant  test  statistic  does  not  depend  on  the  nuisance  parameter  o2,  and 
were  it  not  for  this  parameter  H 0  would  be  simple. 
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2.6.7  Examples  of  Integral  Equations  in  Estimation  and  Hypothesis 
Testing 

In  this  subsection,  we  provide  examples  of  problems  in  'unbiased  “itimation  and  hypothesis 
testing  which  give  rise  to  integral  equations  of  the  first  kind.  The  first  example  is  a  problem 
of  unbiased  estimation  chosen  because  it  is  simple  and  because  it  illustrates  an  iterative 
algorithm  which  we  will  discuss  in  later  chapters.  The  hypothesis  testing  example  provides 
a  preview  of  the  Behrens-Fisher  problem,  to  be  presented  m  much  more  detail  in  Chapter 
5. 


Determining  an  Unbiased  Estimator 

Let  X  be  a  random  variable  with  probability  density 


/(*;*) 


i 


o 


for  x  >  0 
for  x  <  6. 


(2.106) 


We  will  determine  an  unbiased  estimator  of  02,  that  is,  a  function  h(X)  such  that 


roc 

E6[h{X)}=  h(x)f(x;6)dx  =  02. 
Je 


(2.107) 


If  such  an  estimator  exists,  it  can  be  shown  to  be  the  unique  minimum  variance  unbiased 
estimator  of  02 . 

We  will  solve  this  problem  by  employing  an  iterative  algorithm  which  is  a  special  case 
of  the  method  to  be  considered  in  later  chapters.  Given  an  approximation  hn(X)  to  h(X), 
we  define  hn+1(X)  to  be 


roo 

hn+\x)  =  hn(x)  +  02-  /  hn{y)f{y-9)dy 

JO  1 $=x 


(2.108) 


where  once  the  function  of  6  in  the  square  brackets  is  calculated,  6  is  to  be  replaced  with 
x. 

Let  h°(x)  =  0.  We  can  easily  calculate  the  first  two  moments  of  X, 


and 


Eg{X)  =  0  +  1, 
Eg(X2)  =  92  +  20  +  2, 


(2.109) 

(2.110) 


and  use  these  moments  to  show  that 


h°(x)  =  0, 

(2.111) 

hl(x)  =  x2, 

h2(x)  =  x2  -  2x  -  2,  and 

hn(x)  =  x2  -  2x, 

for  n  >  2.  The  random  variable  h(X)  =  X2  -  2X  is  the  unbiased  estimator;  the  algorithm 
converged  to  the  exact  solution  in  three  iterations. 
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The  Behrens-Fisher  Problem 

Let  Xu, i  =  l,...,ni  and  X2i, i  =  l,...,n2  denote  random  samples  from  normal  popu¬ 
lations  with  means  and  variances  {hi,ct\)  and  (h2i&2),  respectively,  and  let  the  sample 
means  and  variances  be  Xj  and  Sj ,  for  j  =  1,2. 

Consider  the  problem  of  testing  the  composite  null  hypothesis 

Ha  :  Hi  =  H2  (2.112) 

against  the  alternative 

Hi  :  m  >M2-  (2.113) 

If  the  variance  ratio,  r  =  a\/a\^  is  known  then  since 

D  =  Xx  -  X2  ~  NQii  -  H2,cri/ni  +  a\fn2 ),  (2.114) 

and  D  is  independent  of  the  estimate  of  =  ra\ 


r-  _  (ni  “  l)s\  +  r(n2  -  l)Sf 

^  —  l  o  ’ 

Tli  -f-  712  —  2 

which  is  proportional  to  a  Xn!+n2-2  random  variable,  we  have  that 


D  -  (Hi  ~  M2) 
W1  +  (rn2)-1]A' 


^  ni  4"H2 — 2  • 


(2.115) 


(2.116) 


Thus,  a  simple  extension  of  the  student-t  hypothesis  test,  discussed  in  Section  2.6.5  for 
a  single  sample  situation,  provides  an  effective  means  of  performing  hypothesis  tests  and 
obtaining  confidence  intervals  for  this  two-sample  case  where  r  is  known. 

The  situation  where  the  variance  ratio  is  unknown  is  usually  referred  to  as  the  Behrens- 
Fisher  problem.  This  problem,  the  main  topic  of  Chapter  5,  has  been  controversial  and 
important  to  the  theory  of  statistics. 

A  natural  test  statistic  to  consider  for  the  Behrens-Fisher  problem  is 


Ai  -  X2 

\/£l  + 

V  nj  n2 


(2.117) 


where  the  null  hypothesis  is  rejected  when  U  is  observed  to  be  greater  than  a  critical 
value,  which  is  a  function  of  the  data  to  be  determined.  We  will  allow  the  critical  value 
of  this  test  statistic  to  depend,  in  an  unspecified  way,  on  the  sample  variances. 

We  would  like  the  size  of  this  hypothesis  test  to  not  depend  on  the  nuisance  parameter 
r,  or  equivalently,  on 

6  =  2,a^n\,  ■  (2.118) 
°i/ni  +  o\ln2 

A  sample  estimate  of  this  parameter  is  the  statistic 


Si/ni 

Sl/ni  +  Sl/n2' 


(2.119) 


We  pose  the  following  mathematical  problem:  determine  a  function,  d,  of  the  random 
variable  R  so  that,  given  that  the  null  hypothesis  is  true, 


P(U  >  d(R) |0)  =  q, 


(2.120) 


for  all  6.  A  function  d(R )  satisfying  (2.120)  would  provide  a  critical  value  statistic  for  a 
similar  test  of  the  hypothesis  (2.112)  against  the  alternative  (2.113)  of  size  a. 

This  function  d ,  if  indeed  it  exists,  will  be  shown  in  Chapter  5  to  be  a  solution  of  the 
following  nonlinear  integral  equation: 


+1/2 


d(W)y/Vl  +  v2 


[xe 

V 


(i-x)(i-d) 


v2 


) 


=  1  -  or, 


(2.121) 


where  Uj  =  nj  —  1;  the  expectation  is  with  respect  to  X ,  a  Beta  random  variable  with 
parameters  v\/2  and  1/2/2;  Tv(-)  denotes  the  t  distribution  with  t)  degrees  of  freedom;  and 
W  denotes  the  random  variable 


xe/vx 

X6/vx+(l-  X)(l-0)/iV 


W  = 


(2.122) 


Chapter  3 

Richardson’s  Algorithm, 
Preconditioning,  and  Iterative 
Regularization 


3.1  Richardson’s  Algorithm 

Let  K  :  Li\Q,  1]  — ►  £2(0, 1]  be  compact.  Consider  the  linear  equation  of  the  first  kind 

Kf  =  g,  (3.1) 

where  /  and  g  are  in  L2,  and  g  is  a  known  function.  Define  the  iteration 

/"+»=/"  +  BD{g  -  Kfn),  for  n  =  0,l,2,...,  (3.2) 

where  9  is  a  positive  constant,  f°  :  L2  —>  L2  is  arbitrary,  and  D  is  a  known  invertible 
linear  operator  with  a  bounded  inverse.  This  is  the  iterative  algorithm  which  will  concern 
us  for  most  of  this  chapter.  When  D  =  I,  the  identity  operator,  (3.2)  is  Richardson’s 
algorithm, 

/n+1  = /"  +  %  -  A7n),  for  n  =  0,1,2,...,  (3.3) 

proposed  by  Richardson  (1910)  for  the  iterative  solution  of  sparse  linear  systems.  We 
would  like  to  choose  D  to  accelerate  convergence  to  a  vector  /  such  that  \\Kf  —  gl  is 
sufficiently  small,  a  practice  known  as  preconditioning,  where  D  will  be  referred  to  as  the 
preconditioning  operator. 

The  plan  of  this  chapter  is  as  follows.  We  consider  first  the  behavior  of  (3.3)  in  various 
situations.  Except  for  a  literature  review,  we  do  not  consider  the  convergence  of  this 
algorithm  to  a  solution  in  L2-  Since  we  are  ultimately  interested  in  solving  discretizations 
of  integral  equations  on  a  computer,  we  are  more  interested  in  singular,  and  possibly 
inconsistent,  matrix  equations  than  in  functional  equations,  and  we  are  more  interested  in 
how  close  this  algorithm  comes  to  a  smooth  near-solution  in  a  few  dozen  iterations  than 
in  ultimate  convergence  to  a  solution.  We  introduced  the  concept  of  a  near-solution  in 
Chapter  1.  Having  established  some  elementary  ideas  of  functional  analysis  in  Chapter 
2,  we  can  now  be  more  specific.  We  will  say  that  a  function  /  is  a  near-solution  to  an 
equation  K  f  =  g  if 

V< f  -  9\  <  T\g[,  (3.4) 
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where  the  constant  r  is  application  dependent.  A  choice  of  r  corresponds  to  a  decision 
concerning  what  is  considered  to  be  a  ‘small’  residual. 

Next,  we  consider  the  choice  of  a  preconditioning  operator,  and  our  interest  shifts  to 
(3.2).  A  particular  choice  of  D  leads  to  the  Conditional  Expectation  algorithm,  which  is 
motivated  in  several  ways  and  illustrated  on  various  examples.  We  will  demonstrate  that 
the  Conditional  Expectation  algorithm  can  quickly  lead  to  near-solutions. 

If  K  is  an  integral  operator  with  a  smooth  kernel,  than  (3.3)  will  tend  to  produce 
smooth  approximate  solutions.  There  is  regularization  implicit  in  the  iteration,  and, 
following  the  discussion  of  convergence  theory  and  preconditioning,  we  present  the  idea 
of  iterative  regularization. 

This  chapter  concludes  with  the  discussion  of  examples;  details  of  the  numerical  im¬ 
plementation  of  the  algorithms  are  provided  in  Appendix  A. 

3.1.1  Convergence  of  the  Richardson  and  Landweber  Algorithms  in  L 2 

Proofs  of  the  convergence  of  (3.3),  under  various  conditions  on  the  operator  A',  appear  in 
the  literature,  which  we  briefly  review  here.  For  more  information,  a  good  place  to  start 
is  Patterson  (1974)  and  the  references  there. 

If  the  operator  K  is  positive  and  compact,  then  it  is  necessarily  self-adjoint  and  it 
has  a  denumerable  set  of  nonnegative  eigenvalues.  Moreover,  ||Aj|  =  Aj,  where  Ai  is  the 
largest  eigenvalue  of  K .  If  A*  is  only  assumed  to  be  compact,  then  K“I\  is  positive  and 
compact.  It  is  not  difficult  to  show  (Patterson,  1974,  p.  7)  that  if  (3.1)  is  solvable,  then 
it  has  the  same  solutions  as 

ICKf  =  K'g.  (3.5) 

Landweber  (1951)  considered  the  iteration 

/n+1  =  r  +  0K.{g  _  Kfn)  (3.6) 

where  0  <  9  <  2/Aj,  and  proved  convergence  for  a  Fredholm  integral  operator  having  a 
continuous,  real  kernel.  If  a  solution  exists,  then  (3.6)  converges  to  a  solution,  otherwise 
this  iteration  converges  to  a  function  which  minimizes  |<7  —  Kf\.  If  K  is  positive  and 
compact,  then  Landweber  has  also  (trivially)  proved  convergence  of  Richardson’s  algo¬ 
rithm  (3.3)  for  0  <  9  <  2/Aj,  although  he  did  not  comment  on  this  fact.  Bialy  (1959,  see 
also  Patterson,  1974,  pp.  33-41)  generalized  Landweber’s  results  to  K  bounded,  but  not 
necessarily  compact. 

3.1.2  Richardson’s  Algorithm  for  Matrix  Equations 

Usually,  the  iterations  of  Richardson’s  and  Landweber’s  algorithms  cannot  be  performed 
analytically.  Instead  one  discretizes  an  integral  equation  in  order  to  obtain  an  approximat¬ 
ing  matrix  equation.  The  discretization  schemes  used  in  this  thesis  for  integral  equations 
of  the  first  kind  are  described  in  Appendix  A. 

We  therefore  consider  in  this  section  Richardson’s  algorithm  (3.3)  applied  to  the  matrix 
equation  Kf  =  g ,  where  K  is  square  and  possibly  singular,  and  g  is  not  necessarily  in  the 
range  of  K;  i.e.  the  equation  might  be  inconsistent. 

We  are  interested  in  matrix  equations  which  are  approximations  to  ill-posed  integral 
equations,  so  situations  where  K  is  singular  and/or  the  matrix  equation  is  inconsistent  are 
particularly  important.  The  L2  convergence  theory  reviewed  in  the  previous  subsection 
is  of  little  use  here.  Indeed,  for  reasons  discussed  in  Section  2.4.2,  we  are  less  interested 
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in  ‘solving’  the  equation  than  in  finding  a  smooth  ‘near-solution’:  hence,  an  algorithm 
which  ultimately  diverges  might  still  be  of  considerable  use  if  it  quickly  leads  to  such  a 
near-solution,  at  which  point  the  iteration  can  be  terminated. 


Some  Notation  and  a  Preliminary  Lemma 


Let  K  be  an  m  x  m  matrix  of  rank  q  <  m.  The  Jordan  form  of  the  matrix  K  will  be 
written  as 


—  n-i 


K  =  B 


J 11  O12 
O21  N2  2 


B  = 


(3.7) 


f  B1  B7  ,  ,1 

Jlltxt  0j2»x(m— *) 

B\.txm 

[  umxi  umx(m-i)  J 

021(m-j)x3  ■^22(m—s)x(m-s) 

7?2-(m  — j)  X  m 

where  Jn  is  a  nonsingular  matrix  of  Jordan  blocks,  and  N22  is  a  nilpotent  matrix  of 
index  1  >  1  of  Jordan  blocks  corresponding  to  a  zero  eigenvalue.  The  dimensions  of  the 
submatrices  are  as  indicated,  and  s  <  q.  Either  the  row  or  the  column  dimension  of  each 
block  in  the  partitioned  matrices  B  and  jB-1  is  equal  to  m;  the  dot  indicates  which.  Also, 
the  use  of  superscripts  and  subscripts  on  these  blocks  is  intended  to  aid  in  identifying,  at 
a  glance,  that  a  product  such  as  B'lB\.  is  conformable. 

We  will  also  make  use  of  the  partitioned  identity  matrix 


(3.8) 


T  — 

Lllaxs 

0l2»x(m-») 

•*mxm  — 

[  ^2i(m_j)x«  ^22(m-j)x(m— 

1 B  =  I,  we  have  the  identities 

BlBi+  B2B2.  = 

/, 

B\.B'1  = 

In, 

B1.B2  = 

O12. 

B2.B 1  = 

O21,  and 

B2.B2  = 

/22- 

(3.9) 


We  will  use  the  above  notation  and  identities  in  the  following  lemma.  The  various  parts 
of  this  lemma  are  either  well  known,  or  else  follow  directly  from  well  known  results  in 
texts  such  as  Campbell  and  Meyer  (1979). 

Lemma  3.1.1  Let  1\  be  an  m  x  m  square  matrix  of  index  1  >  0  with  Jordan  form  (3.7). 
Let  V  =  B'1  B\..  Then 

a)  V  and  I  —  V  are  projections  onto  TZ{V)  and  7Z(I  —  V),  respectively, 

b)  V(I-V)  =  (/-  V)V  =  0, 

c)  x  =  Vx  +  (I  -  V)x  is  the  unique  decomposition  x  =  ij  +  X2  for  which  X\  €  7Z(V) 
and  X2  €  7v(/  -  V),  and 

d) 

n(V)  =  Af(I  -  V)  =  Tl{BA)  =  Af(B2.)  =  72(A"),  and  (3.10) 

K(I  -V)  =  N(V)  =  A f(Bv)  =  TZ(B-2)  =  A f{Kl).  (3.11) 
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Proof:  (a)  Since 


V 2  =  Bx(Bi.Bx)Bl.  =  B-'ftiBt.  =  V 


(3.12) 


and 

(7  -  V)2  =  7-fV'2-2V  =  7  +  V-2V=7-V\  (3.13) 

both  V  and  I  -  V  are  idempotent,  and  hence  projection  matrices.  We  say  that  V  projects 
onto  R(V)  along  Js[(V),  and  that  I  -V  projects  onto.V(V)  along  TZ(V).  Note  that  these 
projections  are  in  general  not  orthogonal. 

(b)  This  follows  immediately  from  (a): 

V(I  -  V)  =  V  -  V2  =  V  -  V  =  0;  (3.14) 

similarly  (7  -  V)V  =  0. 

(c)  Of  course,  x  =  Vx  +  (I  -  V)x  is  one  such  decomposition.  Let  x  =  xi  +  x 2,  where 
X\  €  V.(V)  and  x2  €  72(7  -  V2).  Then  there  exist  vectors  yx  and  y2  such  that  xi  =  V 3/1 
and  x2  =  (/  -  V)y2 ■  So 

Vx  =  V2yi  +  V(I -V)y2  =  VVi  =  xu  (3.15) 

and 

(7  -  V)x  =  (/  -  V)VVl  +  (7  -  V)2y2  =  (/  -  V)y2  =  x2l  (3.16) 

where  we  have  made  use  of  (a)  and  (b).  Therefore,  the  decomposition  is  unique. 

(d)  We  note  first  that 

aHb1  * 2 1  [  Z  ]  [  £  ]  ■ (317) 

since  N^2  =  0  by  the  definition  of  the  index  1. 

If  x  €  72(7i  *),  then  there  exists  a  y  such  that 

x  =  ICy  =  Bx{JlnBx.y)  =  Bxz ,  (3.18) 

for  some  vector  z,  so  7 Z(Kl)  C  7 Z{B'X).  Now  let  x  €  TZ(B'X),  so  that 

x  =  Bxy  =  B’\ruBx.B'xJ^)y  =  BxJLuBvz  =  7i‘z,  (3.19) 

so  7 Z{B  X)  C  72(7C),  and  hence  7 Z(B  X)  =  7 Z(Kl). 

Obviously, 

AT(Bh)  C  N{B-lJ'uBx.)  =  Af(IC).  (3.20) 

Let  x  €  Af(Kl).  Then, 

l\lx  =  £ xJluBx.x  =  0  =►  Jyi(B\.B'x  )7ji  5,  ,x  =  J9,.x  =  0,  (3.21) 

so  Af(Kl)  C  Af(B\  ),  and  hence  Af(Bx.)  = 

We  show  next  that  7v(V")  =  //(/  -  V")  and  72(7  -  V)  =  N{V).  Assume  that  x  £ 
72(7  -  V).  Then,  using  (b), 

1  =  (/  -  V)y  =>  Vx  =  V(I  -  V)y  =  0  =>  x  €  M{V).  (3.22) 

Conversely,  if  x  €  M(V)  then 

Vx  =  0  =>  x  -  Vx  =  (7  -  V)x  =  x  =>  x  €  72(7  -  V).  (3.23) 
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Therefore,  72(7  -  F)  =  Af(V). 

Similarly,  assume  that  x  €  72(F).  Then 

i  =  Vy  =>  (/  -  V)x  =  (/  -  F)Fy  =  0  =>  x  €  Af(7  -  F).  (3.24) 

If  a:  €  Af(7  -  V ),  then 

(i  -v)x  =  o=>  x  =  vx  =>  x  e  72(F).  ( 3.25 ) 

Therefore,  72(F )  =  AT  (I  -  V). 

Now  we  show  that  72(F)  =  72(7?'1)  and  Af{V)  =  Af{B\.).  The  first  of  these  follows 
from 

y  €  72(F)  =>y  =  Vx  =  BTxBx.x  =  B  xz  =>  y  6  n(Bx)  (3.26) 

and 

y  €  H{B  l)  =>  y  =  5*x  =>  y  =  =  (51B,.)51x  =  Fz  =>  y  €  72(F).  (3.27) 

The  identity  Af(V)  =  Af(B\.)  follows  similarly  from 

x  €  -Af(F)  =>  Fx  =  B'x B\.x  =  0  =*  (B\.B’x)B\.x  =  Bj.x  =  0  =>  x  €  Af(B\.)  (3.28) 

and 

x  €  Af(Bi.)  =>  Bj.x  =  0  =>  B'x B\.x  -  Vx  =  0  =>  x  €  Af(V).  (3.29) 

We  complete  the  proof  of  this  lemma  by  showing  that  Af{B2.)  =  Af(I  -  F)  and 
72(7?'2)  =  72(7-F).  Since  7-F  =  B'2B2.,  we  have  immediately  that  Af{B2.)  C  Af(I-V). 
If  x  €  Af(f  -  F),  then 

(/  -  F)x  =  B  ^j.x  =  0  =>  (B2.B2)B2.x  =  B2.x  =  0,  (3.30) 

and  so  Af(I  -  V)  C  jV( f?2- )•.  and  hence  Af(B2.)  =  A^(/  -  F).  Finally,  if  x  €  72(7  -  F), 
then 

x  =  (7  -  F)y  =  B'2B2.y  =  B  2z,  (3.31) 

so  72(7  -  F)  C  72(77  2).  Conversely,  if  x  €  72(77'2),  then 

x  =  B  2y  =  B2{B2.B2)y  =  (7  -  F)(R2y)  =  (7  -  F)x,  (3.32) 

therefore  72 (B  2)  C  72(7  -  F),  so  72(5  2)  =  72(7  -  F).  I 

Convergence  of  Richardson's  Algorithm  for  Nonsingular  Matrix  Equations 

Convergence  of  Richardson’s  algorithm  (3.3)  to  the  unique  solution  /  =  K~xg  of  the 
equation  Kf  =  g  where  I\  is  nonsingular  depends  on  the  spectral  radius  of  the  iteration 
matrix 

G  =  7  -  8K.  (3.33) 

To  see  this,  let  I\f  =  g  and  note  that  (3.3)  leads  to 

(/  -  n  =  (/  -  r_1)  -  oku  -  r-1)  =  gu  -  r~ly  (3.34) 

and  hence,  if  we  let 

u"  =  /  -  /",  (3.35) 
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then 

un  =  Gnu°.  (3.36) 

If  p(G )  <  1,  then  by  Theorem  2.2.2,  Gn  — *  0,  so  un  — ►  0  for  ail  initial  approximations  J° 
and  all  right  hand  sides  g.  If  p(G)  >  1,  then  Gk  /*  0.  So,  by  the  definition  of  a  convergent 
matrix,  there  must  exist  vectors  u°  for  which  un  =  G"ti°  •/*  0,  hence  there  exist  initial 
vectors  f°  such  that  fn  /*  f.  We  have  established  the  following  theorem: 

Theorem  3.1.1  (Convergence  for  The  Nonsingular  Case)  Let  f  =  K~lg.  A  nec¬ 
essary  and  sufficient  condition  for  the  iteration  (3.3)  to  converge  to  f  for  all  f°  is  that 
p(7  —  8K)  <  1. 

If  K  is  positive  definite  then,  because  of  the  spectral  theorem  (Theorem  2.1.1),  the 
behavior  of  Richardson’s  algorithm  is  particularly  transparent.  Let  A'  be  m  x  m  and 
positive  definite,  with  eigenvalues 

Aj  >  A2  >  ...  >  Am  >  0,  (3.37) 

and  corresponding  orthonormal  eigenvectors  The  condition  p(I  -  9K)  <  1  trans¬ 

lates  to 

-  1  <  1  -  0A,  <  ...  <  1  -  0\m  <  1,  (3.38) 

and  we  have 

0  <6  <  2/Aj  (3.39) 

as  the  necessary  and  sufficient  condition  for  convergence  for  arbitrary  f°.  The  solution  /, 
the  right  hand  side  g,  and  the  iterates  fn  can  be  expressed  in  terms  of  these  eigenvectors 
as,  say, 


II 

(3.40) 

1=1 

m 

g  =  fl»v"  311(1 

(3.41) 

i=l 

m 

r  =  X>>.- 

(3.42) 

i=i 


Because  of  the  orthonormality  of  the  {v,},  the  Richardson  iteration  (3.3),  in  the  form 
(3.34),  leads  to  the  following  expression  for  the  coefficients  {c"}: 

c,  -  c”  =  (1  -  0A,r(c,  -  -«)  (3.43) 

for  i  =  1, . . .  ,m  and  n  >  0.  If  the  condition  (3.39)  holds,  then,  for  each  i, 

lim  c"  =  c,,  (3.44) 

n— *  oo 

so  /"  — *  /.  Note  that  since  A  /  =  g , 

m  m  ro 

K  f  =  *ivivicjvj  ~  *'c'v'  (3.45) 

i=i j=i  «=i 


implies  that 
for  i  =  1, . . .  ,m. 


(3.46) 
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Convergence  of  Richardson’s  Algorithm  for  Singular  Matrix  Equations 

We  come  now  to  the  central  results  of  this  section.  What  if  the  square  matrix  K  in  the 
equation  K  f  =  g  is  singular,  so  that  this  equation  has  either  zero  or  else  infinitely  many 
solutions?  We  consider  in  this  subsection  conditions  under  which  Richardson’s  algorithm 
applied  to  a  singular  matrix  equation  converges  to  a  solution;  a  necessary  condition  for 
this  is  that  g  £  TZ(K).  The  geometry  underlying  Lemma  3.1.1  leads  directly  to  the 
following  sufficient  conditions  for  convergence: 

Theorem  3.1.2  (Convergence  for  The  Singular  Case)  Let  K  be  a  square,  singular 
matrix  with  index  i.  Richardson’s  algorithm  (3.3)  converges  to  a  solution  f  of  the  equation 
K  f  —  g  if  the  following  conditions  are  satisfied: 

1.  All  of  the  nonzero  eigenvalues  of  the  matrix  OK  are  contained  in  the  interior  of  the 
unit  circle  with  center  (1,0)  in  the  complex  plane, 

2.  f°eK(K‘-'),  and 


3.  g  €  R(Kl), 

where  we  interpret  1Z(K°)  to  mean  1 Z(I). 

Proof:  Using  the  notation  of  Section  3.1.2,  note  that  the  index  t  >  1,  and  write  (3.3)  in 
the  form 


/"  =  (/ -0K)nf°  +£(/  -0K)'0g 


(3.47) 


B ' 

B2  ] 

[R> 

-0J\\)n  0  j  2 

O21  (hz~0N22) 


B > 

"  b2. 


B 1 

'  B* 


4-  f  B 1  B2  1  2-’i=o'.In  -  OJuY  0J2  Bx. 

+  l  )[  0„  ESiV*  -  ]  l  ft.  P 

T)  —  1 

=  B V„  -  0Ju)nBx.f°  +  B1  £(/„  -  OJnYBx.Og 

1=0 

n—  1 

+  R2(/22  -  0N22)nB2.f°  +  B  2  ^ (/22  -  0N22)‘ B2  Og 

1=0 

=  a^  +  aj+a^+aj. 

The  nonzero  eigenvalues  of  OK  correspond  to  the  eigenvalues  of  the  nonsingular  matrix 
f\\  —OJ 11.  If  A  is  any  eigenvalue  of  OK  in  the  interior  of  the  circle  specified  in  the  statement 
of  the  theorem,  then  1  —  OX  is  contained  in  the  interior  of  the  unit  circle  centered  at  the 
origin,  so  condition  1)  implies  that 


p(I\\  -  6J 11)  <  1. 

Hence,  by  Theorem  2.2.2, 

lim  a"  =  lim  BX{I\\  -  0J\\)n  B\.f°  =  0 

n— *oo  n— *oo 

and  by  Lemma  2.2.3 


(3.48) 

(3.49) 


“5  =  Jll”, B’1  ^dn-OJuYBx.Og  =  BxJfxBvg. 


(3.50) 
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(3.51) 


Next,  assume  that  f°  €  H{Kl~x).  Then  there  exists  a  vector  y  such  that 

f  =  BlJU1B1.y  +  B-2N£1B2.y , 

and  hence,  for  all  n  >  0, 

a?  =  B‘\l22  -  0N22)nB2.f°  (3.52) 

=  B  \I22  -  6N22)nB2.  (B  'JS'Bi.y  +  B  2N£lB2.y) 

=  E  (  ”  )  B  \-6N22YB2.  (B  'JZ'Bi.y  +  B2N£'B2.y) 

=  E  (  ”  )  B  \-0YNit'~lB2.y 

=  B2N£'B2.y , 

where  we  have  used  the  identities  (3.9)  and  the  definition  of  the  index  t.  Note  that  a 3  is 
independent  of  n.  The  remaining  condition  and  Lemma  3.1.1  (d)  together  lead  to 

g  €  U{Kl)  =  AT(52.),  (3.53) 

therefore  the  term  a™  of  (3.47)  is  equal  to  zero  for  all  n. 

If  the  conditions  of  the  theorem  are  satisfied,  then 

fn  =  Jin^K  +  a2  +  «3  +  )  =  B-1  Jf,1  Bx.g  +  B2N£'B2.y,  (3.54) 

for  some  vector  y.  We  demonstrate  that  the  algorithm  leads  to  convergence  to  a  solution 
by  evaluating  A'(limn_00  /”): 

I<  {B  'j-'Bx.g  +  B-2N£'B2.y)  (3.55) 

=  (B 1  JnBi.  +  B2N22B2 .)  [B1Jxl1B1.g  +  B  2N^lB2.y) 

=  B'JnBx.B'j-'Bx.g  +  B2N22B2.B2N£lB2.y 
=  BxJxxJ^Bx.g  +  B2N^B2.y 
=  B  xBx.g  -Vg  =  g, 

where  we  have  used  the  identities  (3.9)  and  Lemma  3.1.1.  Note  that  the  term  involving 
y  of  (3.54)  depends  on  f°,  and  will  not  be  unique  since  1  >  0.  I 

Corollary  3.1.1  (Global  Convergence)  If  K  is  a  singular  matrix  with  a  diagonaliz- 
able  nullspace,  then  conditions  1)  and  3)  of  Theorem  3.1.2  are  necessary  and  sufficient 
for  Richardson’s  algorithm  (3.3)  to  converge  to  a  solution  for  all  initial  vectors  f°  (which 
is  now  the  content  of  condition  2)). 

Proof:  We  will  use  the  notation  of  Theorem  3.1.2.  If  K  has  a  diagonalizable  nullspace, 
then  1  =  1.  Conditions  3)  and  2)  of  Theorem  3.1.2  become 

g  €  TZ(Kl)  =  TZ(K)  (3.56) 

and 

f°  €  TZ{Kl~l)  =  n(I).  (3.57) 
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Sufficiency  thus  follows  from  Theorem  3.1.2  with  t  =  1. 

To  prove  necessity,  first  note  that  if  condition  3)  is  violated,  then  a  solution  does 
not  exist,  so  Richardson’s  algorithm  cannot  converge  to  a  solution.  Next,  assume  that 
condition  3)  holds,  but  that  condition  1)  does  not.  Then  g  €  7£(A'),  so  a  solution  exists. 
We  will  show  that  for  any  g  there  exist  vectors  f°  for  which  Richardson’s  iteration  does 
not  converge. 

From  Lemma  (3.1.1)  (d),  observe  that 

U(K)  =  N(B2.)y  (3.58) 

hence  B2.g  =  0.  Using  (3.58),  together  with  the  hypothesis  i—  1,  the  expression  for  the 
nth  iterate  (3.47)  becomes 

n— 1 

r  =  B‘l(In-ej11)nBl.j°  +  B-1'E(Ill-ejll)iBl.0g  (3.59) 

i=0 

+  n0B-2B2.f  =  a”  +  aj  +  n6(I  -  V)f°. 

Since  condition  1)  does  not  hold, 


p(I\\  —6J\\)  >  1,  (3.60) 

and,  by  Theorem  2.2.2,  (In  -0Ju)n  0.  Hence,  there  exist  vectors  f°  such  that  a”  ■/*  0, 
as  well  as  vectors  f°  (for  example,  f°  =  0)  for  which  a"  does  converge  to  zero. 

For  any  g,  either  aj  converges,  or  it  does  not.  Assume  that  converges.  Then  choose 
any  f°  €  ft(A')  for  which  a”  does  not  converge.  To  see  that  such  an  J°  must  exist,  begin 
by  choosing  any  vector  f°  for  which  a”  does  not  converge.  Such  a  f°  cannot  be  in  Af(Bi.), 
which  equals  N(I\)  by  Lemma  3.1.1  (d).  Lemma  3.1.1  (c)  implies  that 


f0  =  Vfo  +  (I-V)P  =  f°R  +  f°N, 


and  this  decomposition  is  unique.  Since 

f°  t  Af(I<)  =  K(I  -V), 


f°R  *  0.  Let 


/°  =  f°R  €  TZ(K). 


But  (I  -  V)  projects  onto  AZ’(A’),  hence  6n(I  -  V)f°  —  0  and 


r 


a?  +  o 


n 
2  y 


where  a"  does  not  converge,  but  aj  does.  Therefore,  fn  does  not  converge. 
Assume  that  aj  does  not  converge.  Then  let  f°  =  0,  so  that 


(3.61) 


(3.62) 

(3.63) 


(3.64) 


/"  =  aS,  (3.65) 

which  diverges.  Conditions  1)  and  3)  are  therefore  necessary,  and  the  proof  of  the  corollary 
is  complete.  I 
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Inconsistent  Equations 

An  inconsistent  equation  has  no  solution.  However,  if  the  right  hand  side  of  such  an 
equation  is  replaced  with  any  projection  onto  72(A'),  then  the  equation  which  results  will 
have  many  solutions.  Usually,  one  considers  orthogonal  projections,  but  we  will  find  the 
generally  non-orthogonal  projection  provided  by  the  matrix  V  of  Lemma  3.1.1  to  be  more 
convenient  for  our  purposes. 

Let  K  be  an  m  x  m  singular  matrix  of  index  i.  Let  g  £  Cm  be  an  arbitrary  vector. 
From  Lemma  3.1.1  (a-c) 

Cm  =  TZ(Kl)®Af(Kt),  (3.66) 

so  we  can  express  g  as 

9  =  9R  +  9N,  (3.67) 

where,  gR  £  Tl(Kl),  and  gw  £  Af(Kl).  From  Lemma  3.1.1,  we  note  that  gR  and  g\  are 
uniquely  determined  by 

9R  =  Vg  (3.68) 

and 

gN  =  (I~  V)g,  (3.69) 

respectively,  where  V  =  B  We  call  any  vector  /  such  that  Kf  =  Vg  a  generalized 
solution ,  and  we  write 

A7=V  (3.70) 

If  K  is  a  singular  matrix  of  index  t,  and  if  Kf  =  g  is  inconsistent,  then  since 

g  ll(K)  =*  g  i  1 Z(IC),  (3.71) 

we  have  as  a  consequence  of  the  proof  of  Theorem  3.1.2  that  {/”}  is  not  expected  to 
converge.  Using  the  notation  of  this  theorem,  we  will  examine  the  rate  of  divergence  of 
the  sequence  {/"}  for  the  case  where  g  £  7 Z(K)  in  order  to  develop  some  understanding 
of  how  useful  Richardson’s  algorithm  can  be  if  gw  is  small,  but  nonzero. 

From  equation  (3.47),  we  have  that 

fn  =  a”  +  a%  +  oS  +  a”.  (3.72) 

Assume  that  p(Iu  -  9JU)  <  1.  Then 

lim  a?  =  0,  (3.73) 

and 

Jjrn^aJ  =  BlJf^B\.g  =  /,  (3.74) 

where  K /g=n  g ,  since  from  (3.55)  we  see  that 

Kf  =  (B'JnBx.  +  B'NiMB'jrfB^g  (3.75) 

=  B'juBvBlJfltB1.g  =  Vg  =  gR. 

The  sequence  a"  converges  to  zero  at  the  rate  p[{I\\  -  ^n)"]- 

The  remaining  terms  aj  and  oj  do  not,  in  general,  have  finite  limits.  If  we  choose 
f°  €  72(/v‘~I),  which  we  can  always  do  by  taking  f°  =  0,  then  a%  is  equal  to  zero  for  all 
n.  However,  aj  can  increase  without  bound  if  pyv  ^  0. 
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The  maximum  rates  at  which  the  terms  aj  and  aj  can  go  to  infinity  follow  from  the 
following  results: 


min(i— l,n)  /  \ 

(/22  -  6N22)n  =  g  ”  J  (-0N22y  =  0(n‘-1)  (3.76) 

and 

n—  1  min(»— l,n— 1)  /  \ 

£(/22-<W22)‘  =  £  HW*0J=O(n‘), 

1=0  j~o  \  J  T 1  / 

where  the  order  symbol  O(-)  is  to  be  interpreted  for  each  element  of  a  matrix. 
f°  6  we  see  that 

-  KT\  =  1 9N  +  (9R  ~  Kfn)\  <  IffNl  +  Is*  -  KTW  ~  0(n:)|sN|.  (3.78) 

If  the  iteration  is  terminated  early,  gs  is  sufficiently  small,  and  i  is  not  too  large,  then 
it  will  often  be  the  case  that  f1  will  be  a  near-solution  when  the  iteration  is  tei:r.inated: 
even  though,  ultimately,  /"  — *■  oo.  A  similar  argument  can  be  made  for  the  case  where 
f°  has  a  small  component  not  in  7J(A'*_1).  In  other  words,  Richardson’s  algorithm  is 
somewhat  robust  to  violation  of  the  requirements  that  g  €  TZ(Kl)  and  f°  €  TZ(Kl~'). 
This  is  reassuring,  since  for  discretizations  of  integral  equations  of  the  first  kind  gs  is 
likely  to  be  small,  but  nonzero. 

On  the  other  hand,  if  p(In -0J\\)  =  />o  >  1  then  a?+ aj  will  diverge  at  the  exponential 
rate  />(}.  The  situation  where  p0  =  1  is  complicated,  since  whether  a?  converges  to  zero, 
and  whether  aj  converges  at  all,  depends  on  the  particular  value  of  f°  and  g,  respectively. 
For  a  given  vector  z,  (/jj  -  9J\\)nz  can  converge  to  zero,  converge  to  a  nonzero  vector,  or 
else  not  converge  at  all,  depending  on  the  choice  of  z  and  the  subspace  of  TZ(Iu  —  OJu) 
which  has  eigenvalues  with  moduli  greater  than  or  equal  to  one.  It  is  difficult  to  make  a 
general  statement  about  the  po  =  1  case,  but  this  situation  is  not  likely  to  be  important 
in  numerical  practice. 

Conditions  on  6  for  Which  p(I  -  OK)  <  1 

The  condition  p{In  -  9J\\)  <  1  involves  both  the  nonzero  eigenvalues  of  K  and  the 
constant  0,  We  will  assume,  without  loss  of  generality,  that  6  >  0.  Let  the  nonzero 
eigenvalues  of  K  be  denoted  {A,}f=1,  and  let  /*,  =  1  -  0A,.  Conditions  on  {AJ*=1  and  0 
which  lead  to  max,  |/j,|  <  1  are  given  in  the  following  lemma: 

Lemma  3.1.2  Let  {A,}*=1  be  a  set  of  complex  numbers,  and  let  p,  =  1  -  0X{,  for  i  = 
1, . . . ,  s.  The  following  conditions  together  imply  that 

max|/i,|<l:  (3.79) 

i 

1.  For  i  =  1, ..  ,,s,  J?A,  >  0,  and 

2. 

o » \ , 

0  <  0  <  min  77-77.  (3.80) 

*  |A.r 


(3.77) 
Hence  if 


57 


Proof:  Assume  9?A,  >  0  for  each  i.  The  condition  that  the  be  in  the  interior 

of  the  unit  circle  is  equivalent  to 

|Ml|2  <  1  <=>  [1  -  0($A,)]2  +  #2(QA,)2  <  1, 

or 

0<*<^-  (3.81) 

Since  (3.81)  must  hold  for  all  t,  we  have  the  condition  (3.80).  I 
The  Nullspace  of  G  =  /  -  OK  When  0I<  >  0  and  p(0K)  =  1 

We  consider  next  the  special  case  of  Richardson’s  algorithm  (3.3)  applied  to  a  matrix 
equation  K /  =  g  for  which  I\  is  positive.  A  positive  matrix  is  a  matrix  for  which  all  of 
the  elements  are  positive,  and  we  write  K  >  0.  From  Lemma  3.1.2  and  Theorem  3.1.2,  it 
is  clear  that  it  is  very  desirable  for  the  nonzero  eigenvalues  of  K  to  have  positive  real  parts, 
since  if  this  is  not  the  case,  then  there  exists  no  6  for  which  the  sufficient  conditions  of 
Theorem  3.1.2  and  the  necessary  and  sufficient  conditions  of  Theorem  3.1.1  and  Corollary 
3.1.1  for  Richardson’s  algorithm  will  be  satisfied.  So  we  will  assume  that  in  addition  to 
K  being  positive,  all  of  the  nonzero  eigenvalues  of  K  have  positive  real  parts. 

By  the  Perron-Frobenius  theorem  (Theorem  2.1.3),  the  largest  eigenvalue  of  K  in 
magnitude  is  positive  and  equal  to  p{K),  all  other  eigenvalues  have  modulus  less  than 
p{K),  and  the  corresponding  Perron-Frobenius  eigenvector  is  a  positive  vector  z.  By 
selecting 

0  =  l/p(K),  (3.82) 

we  have  p{0K)  =  1.  Then  G  =  /  -  OK  has  one  eigenvalue  equal  to  zero  and  all  other 
eigenvalues  of  G  have  positive  real  parts. 

The  matrix  0KT  is  also  positive,  with  p(0I\T)  =  1.  The  Perron-Frobenius  theorem 
implies  that  there  exists  a  positive  vector  y  which  is  an  eigenvector  of  0J\T  corresponding 
to  the  eigenvalue  one.  Since 

yT0K  =  yT ,  (3.83) 

it  is  customary  to  refer  to  yT  as  a  left  eigenvector  of  K  and  to;  as  a  right  eigenvector 
of  K,  both  corresponding  to  the  same  eigenvalue.  When  there  is  no  risk  of  confusion,  we 
will  continue  to  refer  to  right  eigenvectors  simply  as  eigenvectors. 

The  main  result  of  this  subsection  is  clarification  of  the  role  of  the  positive  eigenvalue 
and  corresponding  left  and  right  eigenvectors  in  Richardson’s  algorithm.  In  order  to 
establish  this  result,  we  need  to  build  on  the  geometry  of  Lemma  3.1.1,  and  we  do  this 
next  for  a  general  square  matrix.  Later  we  will  specialize  to  positive  K . 

Assume  that  I\  is  an  mx  m  matrix  and  that  the  Jordan  form  of  K  consists  of  r  <  m 
Jordan  blocks  ,/„.  Let 

J  =  diag(Jn,  Jrr),  (3.84) 

where  we  order  these  blocks  so  that  the  corresponding  eigenvalues, 

|Ai|  >  |A2|  >  ...  >  |Ar|  >0,  (3.85) 

are  in  order  of  decreasing  modulus. 

Using  notation  similar  to  that  of  Section  3.1.2,  we  can  represent  K  as 

K  =  B~XJB ,  (3.86) 
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where 


and 


Then  (3.86)  becomes 


Because 


B'1  =  [B-1  B'2  . . .  B’r] 

r  Bl.  1 


L  Br.  J 


K  =  Y.B^JuBi.. 

1=1 

5#"1  =  B~lB  =  /, 


(3.87) 


(3.88) 


(3.89) 

(3.90) 


we  have  the  following  relations  among  the  components  of  the  partitioned  matrices  (3.87) 
and  (3.88): 


and,  for  i  jt  j. 


r 


(3.91) 

1=1 

Bi.B*  =  /„, 

(3.92) 

Bi.B3  =  0,y, 

(3.93) 

whei-.  the  /„  and  0,y  are  identity  and  zero  matrices,  respectively,  of  the  appropriate 
dimensions.  Because  of  these  relations,  we  can  easily  generalize  Lemma  3.1.1  to  consider 
projections  onto  the  r  subspaces  corresponding  to  the  Jordan  representation  (3.86).  In 
particular,  we  have  the  r  projection  matrices 

Vi  =  (3.94) 

for  i  =  1, . . . ,  r,  where  V?  ~  V  and,  for  i  j ,  ViVj  =  0 ,y.  For  any  vector  x  6  Cm,  we  have 

r  r 

x  =  ViX  =  X"  (3-95) 

1=1  1=1 

where  for  each  i,  the  vector  x,  is  the  projection  of  x  onto  7£(Vj).  It  is  important  to  note 
that  these  projections  are,  in  general,  not  orthogonal. 

Now  assume  that  I\  is  positive,  and  that  all  nonzero  eigenvalues  of  K  have  positive 
real  parts.  Because  of  the  Perron-Frobenius  theorem,  Aj  =  1/9  is  larger  in  modulus  than 
all  other  eigenvalues  of  A',  and  A]  has  algebraic  multiplicity  one.  Hence  V\  is  a  matrix  of 
rank  one.  In  fact,  it  is  not  difficult  to  show  that 

Ft  oc  ozyT .  (3.96) 

Since 

G  =  I  -6K  =  B~\I  -  6J)B,  (3.97) 

the  subspace  onto  which  projects  corresponds  to  an  eigenvalue  of  G  which  equals 
1  -  9/9  =  0,  and  this  eigenvalue  has  multiplicity  one. 
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Next,  let  /  be  a  solution  to  the  consistent  matrix  equation  Kf  - 
Richardson’s  algorithm  (3.3)  converges  to  /.  Let  the  discrepancy  be 

g,  and  assume  that 

c  ~ 
*-» 

1 

•*-, 

III 

c 

3 

(3.98) 

and  write  Richardson’s  iteration  (3.3)  in  the  form 

un+1  =  Gun , 

(3.99) 

where 

G  =  I  -  OK 

(3.100) 

and  Gz  =  0. 

We  have  the  following  decomposition  of  u°  in  terms  of  the  subspaces  {72(Vrj)}|_1  cor¬ 
responding  to  G: 

r  e 

u°  =  V>u°  =  IZ  B'Bi.u0. 

»=i  1=1 

(3.101) 

Since  Vj  corresponds  to  the  zero  eigenvalue  of  G ,  we  have  that 

r 

un  =  Gnu°  =  £  B-‘(/-tfJ„)n  5, .B'Bi.u0 
*=1 

(3.102) 

r 

=  £  B  \I  -  6Ja)n B{.u°. 

»=2 

Hence,  for  all  n  >  0, 

V\un  =  zyTun  =  0. 

(3.103) 

Since  z  >  0,  this  implies  that  yTun  =  0  for  all  n  >  0. 

What  does  this  tell  us?  A  weighted  sum  of  the  components  of  /  -  /”  equals  zero  for 
each  n  >  0,  with  the  weights  corresponding  to  the  positive  left  eigenvector  of  K.  We  can 
easily  calculate  this  vector  for  a  given  problem,  and  this  might  lead  to  insight  into  how 
well  Richardson’s  iteration  can  be  expected  to  perform.  However,  to  calculate  the  left 
Perron-Frobenius  eigenvector  is  of  roughly  the  same  order  of  difficulty  as  the  iteration 
itself. 

If  OK  is  stochastic  (recall  that  a  stochastic  matrix  is  a  nonnegative  matrix  for  which  the 
elements  in  each  row  sum  to  one)  then  0I\  is  the  transition  matrix  of  some  Markov  chain, 
where  the  left  (positive)  eigenvector  of  OK  is  proportional  to  the  stationary  distribution 
of  this  chain,  and  the  right  eigenvector  is  positive  and  constant  (e.g.,  Horn  and  Johnson, 
1985,  487-489).  We  have  shown  above  that,  for  all  n  >  0, 

E(un)  =  0, 

(3.104) 

which  implies  that 

E{un+ 1  -  un)  =  E(6n)  =  0, 

(3.105) 

where  the  expectations  are  with  respect  to  the  stationary  distribution  of  the  Markov  chain 
corresponding  to  OK. 

If  OK  is  symmetric  and  stochastic,  then  then  both  z  and  y  equal  a  constant  vector,  so 
the  sum  of  the  components  of  f  —  fn  will  be  zero  for  all  positive  n.  In  many  situations, 
when  the  sum  of  the  components  of  un  equals  zero,  we  will  have  ||tzn ||  small. 
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3.2  Stochastic  Preconditioning  and  the  Conditional  Ex¬ 
pectation  Algorithm 

When  an  ill-posed  integral  equation  is  discretized,  the  matrix  equation  which  results 
will  have  many  eigenvalues  with  small  absolute  values.  Because  of  this,  G  =  /  -  8K 
will  have  many  eigenvalues  at,  or  near,  one  in  the  complex  plane.  We  have  seen  in  the 
previous  section  that  the  convergence  of  Richardson’s  algorithm  (3.3)  in  the  direction  of 
an  eigenvector  corresponding  to  an  eigenvalue  A  of  G  for  which  |A|  <  1  and  |A|  ss  1  will 
be  slow,  since  the  convergence  rate  in  the  direction  of  this  eigenvector  is  governed  by  the 
powers  of  A. 

This  ultimate  slow  convergence  is  both  an  advantage  and  a  disadvantage.  It  is  advan¬ 
tageous  to  not  rapidly  approach  a  ‘solution’  which,  because  of  noise,  is  neither  smooth  nor 
near  any  solution  to  the  corresponding  integral  equation.  Of  course,  it  is  also  advanta¬ 
geous  for  the  iterates  to  not  diverge  rapidly  if  the  matrix  equation  is  inconsistent.  But  it 
is  disadvantageous  to  use  an  iteration  for  which  the  convergence  becomes  very  slow  when 
the  distance  |j<7  -  A'/n||  is  still  unacceptably  large. 

At  the  beginning  of  this  chapter,  we  mentioned  the  notion  of  preconditioning  so  as  to 
accelerate  convergence.  The  idea  is  to  choose  a  nonsingular  matrix  D  so  that  the  iteration 
(3.2),  repeated  here  for  convenient  reference, 

/n+i  =  /»  +  8D(g  -  Kfn),  for  n  =  0,1,2,...,  (3.106) 

converges  rapidly,  at  least  initially.  We  will  restrict  attention,  for  the  most  part,  to  nonsin¬ 
gular  diagonal  preconditioning  matrices  and  to  square  matrices  A"  with  positive  elements. 
We  will  provide  several  motivations  for  choosing  D  so  that  if  Ii  is  positive,  then  DK  is 
stochastic.  We  will  refer  to  this  form  of  preconditioning  as  stochastic  preconditioning  and 
to  the  algorithm  which  results,  along  with  its  nonlinear  generalizations,  as  the  Conditional 
Expectation  algorithm.  The  effectiveness  of  this  approach  will  be  illustrated  through  ex¬ 
amples  in  this  and  subsequent  chapters.  Stochastic  matrices  are  relevant  in  the  theory 
of  Markov  chains  and  stochastic  processes,  so  the  presence  of  a  stochastic  matrix  here  is 
a  hint  that  a  natural  probabilistic  interpretation  of  this  preconditioned  algorithm  should 
be  possible. 

3.2.1  A  Property  of  Positive  Definite  Preconditioning  Matrices 

Lemma  3.1.2  implies  that  if  the  nonzero  eigenvalues  of  a  matrix  I\  have  positive  real 
parts,  and  if  the  positive  constant  9  is  sufficiently  small,  then  the  eigenvalues  of  I  -  OK 
which  are  not  equal  to  one  will  be  in  the  interior  of  the  unit  circle.  It  is  therefore  a 
desirable  property  of  a  preconditioning  matrix  D  that  if  all  of  the  eigenvalues  of  K  have 
nonnegative  real  parts,  then  the  eigenvalues  of  DI\  have  nonnegative  real  parts  as  well.  We 
demonstrate  below  that  positive  definite  preconditioning  matrices  have  this  ‘nonnegative 
real  part  preserving’  property. 

Theorem  3.2.1  Let  I\  be  a  square  matrix,  and  assume  that  all  of  the  eigenvalues  of  K 
have  nonnegative  real  parts.  Let  D  be  positive  definite.  Then  all  of  the  eigenvalues  of  DK 
also  have  nonnegative  real  parts. 

Proof:  Write  I\  in  the  form 

K  =  (K  +  A-)/2  +  i(K  -  A'*)/(2i)  =  A',  +  iK2.  (3.107) 
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Let  A  be  an  arbitrary  eigenvalue  of  A',  and  let  x  be  a  corresponding  normalized  eigenvector. 
Then 

A  =  xmKx  =  x*A'ix  +  tx*A'2x.  (3.108) 

The  matrices  A'j  and  A’2  are  Hermitian,  and  so  x* A'ji  and  x’A'2x  are  both  real  numbers. 
It  follows  that  x‘ K\x  and  x*A'2x  are  equal  to  3?A  >  0  and  5A,  respectively. 

Since  D  is  positive  definite  (hence,  Hermitian),  D  has  a  positive  definite  Hermitian 
square  root  (Strang,  1976,  p.  241).  For  i  =  1,2, 

DIu  =  Z?1/2  [D^A'.Z?1/2]  D-1'2,  (3.109) 

and  so  DKi  is  similar  to  D1?2  K^D1?2 ,  which  is  congruent  to  A',.  Thus  DK,  has  the 
same  eigenvalues  as  D1/2KiD1/2.  By  congruence  (Lemma  2.1.1),  the  eigenvalues  of 
DXI2K\DXI2  are  nonnegative.  Because  Zl^A^-D1/2  is  Hermitian,  it  has  real  eigenvalues. 
Thus,  the  eigenvalues  of  DK\  are  nonnegative  and  those  of  Z)A'2  are  real.  It  follows  that 
the  eigenvalues  of  DK  have  nonnegative  real  parts.  I 


3.2.2  Positive,  Bounded  Kernels  and  Stochastic  Matrices 
Consider  the  integral  equation  of  the  first  kind 

/  Hx,y)f{y)dy  =  g(x),  (3.110) 

Jo 

where  we  assume  that  the  kernel,  k(x,y),  is  positive  and  bounded.  We  discretize  this 
equation  as  discussed  in  Appendix  A.  This  gives  a  matrix  equation  K f  =  g.  Let  D  be 
the  diagonal  matrix  corresponding  to  stochastic  preconditioning,  that  is  assume  that 

K  =  DI\  (3.111) 


is  stochastic. 

If  the  integral,  in  y,  of  the  kernel  of  (3.110)  were  equal  to  one  for  each  x,  then  this  inte¬ 
gral  equation,  once  discretized,  would  lead  to  a  matrix  equation  having  a  nearly  stochastic 
matrix.  The  reason  why  the  matrix  might  not  be  exactly  stochastic  is  that  the  row  sums 
for  the  discretized  problem  are  numerical  approximations  to  integrals.  We  transform 
(3.110)  into  a  new  equation,  having  the  same  solution,  as  follows: 

/  Hx,y)f(y)dy  =  <?(*)>  (3.112) 

Jo 


where 


Hx,y) 


kjx^y) 
fo  k(x,y)dy' 


(3.113) 


and 


9(x) 


g(j) 

fo  k(x,y)dy' 


(3.114) 


There  are  two  slightly  different  approaches  to  applying  Richardson’s  algorithm  (3.106) 
with  stochastic  preconditioning.  One  way  is  to  normalize  the  equation  as  in  (3.112),  and 
to  discretize  this  transformed  equation.  Richardson’s  algorithm,  (3.3),  could  then  be 
applied,  with  0=1.  The  other  approach  is  to  discretize  (3.110),  and  then  find  the  matrix 
D  satisfying  (3.111).  The  preconditioned  Richardson  algorithm,  (3.106),  could  then  be 
applied,  for  this  choice  of  D  and  with  0=1.  The  second  approach  is  the  more  desirable 
one  for  two  reasons:  the  normalized  kernel  (3.113)  need  only  be  determined  numerically, 
and  the  matrix  of  the  resulting  discretized  equation  is  exactly  stochastic. 
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3.2.3  Some  Heuristic  Motivations  for  Stochastic  Preconditioning 

We  will  make  a  case  for  stochastic  preconditioning  through  several  heuristic  motivations, 
and  we  will  illustrate  this  form  of  preconditioning  in  a  later  section,  and  in  later  chapters, 
with  several  examples.  At  this  time,  a  complete  understanding  of  why  and  when  this  form 
of  preconditioning  works  well  is  not  available.  The  heuristic  motivations  below  suggest 
directions  one  might  follow  in  order  to  attempt  to  answer  these  questions.  For  now,  the 
real  justification  for  our  choice  of  preconditioning  comes  not  from  theory,  but  from  the 
study  of  examples. 

A  Motivation  Provided  by  Condition  Numbers 

From  the  point  of  view  of  numerical  analysis,  scaling  a  positive  matrix  so  that  it  becomes  a 
stochastic  matrix  tends  to  make  the  matrix  better  conditioned.  The  following  is  a  special 
case  of  a  theorem  proved  by  Van  der  Sluis  (1969,  p.18): 

Theorem  3.2.2  Let  K  be  a  nonsingular  positive  matrix,  and  let  jj  ■  jj.  be  either  the  1 2  or 
the  l0 o  norm.  Let  D  be  a  nonsingular  diagonal  matrix.  Then  the  following  measures  of 
the  condition  of  DK  are  minimized  when  the  rows  of  DK  each  sum  to  one: 

1.  Xi(DK)  =  ||DA'|j00||(2?A')-1||.,  and 

2.  X2(DK)  =  lDK\U\\DKl. 

Although  xi  and  \2  each  differs  from  the  usual  condition  number  based  on  the  spectral 
norm,  k  =  all  three  quantities  are  reasonable  measures  of  the  condition  of  a 

matrix.  A  preconditioning  which  minimizes  Xi  and  X2  can  be  expected  to  usually  reduce 
k  as  well.  In  fact,  if  IHloo  is  chosen  for  ||  ||.  in  Xi,  then  Xi  becomes  the  condition  number  of 
a  matrix,  with  respect  to  the  infinity  norm.  In  Section  2.2.3,  we  showed  how  a  condition 
number  relates  changes  in  a  right  hand  side  to  corresponding  changes  in  a  solution  of  a 
matrix  equation.  It  follows  from  this  that  the  smaller  a  condition  number  is,  the  smaller 
the  change  in  the  solution  will  be  for  a  given  change  in  right  hand  side,  and  so  one  would 
expect  Richardson’s  algorithm  to  converge  more  rapidly  for  matrices  with  relatively  small 
condition  numbers. 

A  Taylor  Series  Motivation 

Assume  that  (3.110)  has  a  solution  /,  that  k(x,y)  has  a  peak  with  location  on  a  smooth, 
monotone  curve  y  =  u(x)  in  the  unit  square,  where  r(0)  =  0  and  v(l)  =  1.  Let  fn  be  an 
approximation  to  /  at  the  nth  iteration,  let  un  =  f  -  /",  and  note  that 

/  +  un(y))dy  =  g(x),  (3.115) 

Jo 

where  un  is  now  the  unknown.  Assume  that  un  has  a  Taylor  series  expansion  for  all  y. 
Expand  un  about  v(x),  keeping  only  the  first  term: 

un{v(x)]  [  k(x,y)dy  «  g(x)  -  [  k{x,y)fn(y)dy , 

Jo  Jo 

or 

=  <7(*)-/o  k(z,y)fn(y)dy 
Jo  k(x,y)dy 


(3.116) 

(3.117) 
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If  we  let 


(3.118) 


r+i(x)  =  r(*) +«"(*), 

then,  in  the  special  case  where  v(x)  =  x,  we  have  the  Li  version  of  Richardson’s  algorithm 
with  stochastic  preconditioning.  If  v(x)  is  not  the  identity,  then  a  change  of  variable  in  x 
reduces  the  problem  to  the  special  case. 

A  Probabilistic  Motivation 

A  simple  probabilistic  argument  provides  another  motivation  for  stochastic  precondition¬ 
ing.  Since  k  is  bounded  and  positive,  it  is  proportional  to  the  joint  density  of  two  random 
variables,  say  X  and  Y.  We  write  this  as 

*x,r(x,y)  =  ck(x,y),  (3.119) 

where  the  constant  c  is 

-i a:  fc(x,y)dxdyj  .  (3.120) 

The  normalized  kernel  (3.113)  is  exactly  the  conditional  density  of  the  random  variable 
Y  given  the  random  variable  X : 

*Y\x{y\x)  =  ?X,Y[X'y\  =  i(x,  y).  (3.121) 

Jo  *xy(*,y)dy 

Richardson’s  algorithm  applied  to  (3.112)  with  9  =  1  is 

/"+1(x)  = /*(*)+  /1k-(x,y)(/(y)-r(y))d2/.  (3.122) 

Jo 

Since  the  integral  on  the  right  hand  side  of  (3.122)  can  be  interpreted  as  the  conditional 
expectation  of  the  difference  /  -  /",  we  can  rewrite  (3.122)  (in  terms  of  the  random 
variables  X  and  Y)  as 

fn+'(X)  -  /"( X)  =  E[f(Y)  -  fn(Y)\X ]  .  (3.123) 

In  words:  the  nth  step  in  this  Richardson  algorithm  with  stochastic  preconditioning  is 
the  conditional  expectation  of  the  difference  between  the  solution  and  the  approximation 
fn.  Because  of  this,  we  will  sometimes  refer  to  Richardson’s  algorithm  with  stochastic 
preconditioning  as  the  Conditional  Expectation  algorithm. 

This  probabilistic  interpretation  suggests  that  this  preconditioned  Richardson  algo¬ 
rithm  will  converge  rapidly  when  the  conditional  expectation,  with  respect  to  the  density 
(3.121),  of  /  -  /"  is  nearly  equal  to  /  -  /".  This  will  occur  when  Y  «  X.  For  these  ran¬ 
dom  variables  to  be  nearly  equal,  the  original  kernel  k(x,y)  must  be  peaked  about  the  line 
y  =  x.  The  more  this  kernel  is  peaked,  the  more  rapidly  convergent  this  preconditioned 
Richardson  algorithm  will  be. 

In  fact,  if  Y  /j(A'),  for  some  monotone  function  h,  then  by  defining  Z  =  h(X),  we 
have  Z  «  Y,  and  so  we  have  reduced  the  problem  to  the  case  considered  in  the  previous 
paragraph. 
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A  Motivation  Based  on  Convolution  Kernels 
Consider  the  Fredholm  integral  equation  of  the  first  kind 


f  -  y)J{y)dv  =  s(*)> 

J  ~oo 

where  the  convolution  kernel  k(x  —  y)  is  positive,  bounded,  and 

/oo 

k(x  -  y)ySdy  <  °°» 

►oo 


(3.124) 


(3.125) 


for  all  s  >  0. 

Let  ir<(i)  represent  a  polynomial  of  degree  at  most  t.  For  any  integer  t,  we  have  that 
/  k(x  -  y)yldy  =  f  k(y)(x  -  yfdy  (3.126) 

J  —  OO  J  —  OO 

/oo 

k(y)dy  + 

•oo 

If  we  transform  the  equation  (3.124)  so  that  the  kernel  of  the  transformed  equation  is 

,3-,27) 


we  have  that 

/OO  _ 

k(x  -  y)ytdy  =  *‘  +  *r<-i(z)-  (3.128) 

-OO 

It  is  easy  to  see  that,  if  un  -  f  -  fn  is  a  polynomial  of  degree  t,  then  the  preconditioned 
Richardson  algorithm  (3.106),  applied  to  the  convolution  equation  (3.124)  with  stochastic 
preconditioning,  and  with  0  =  1,  will  exactly  converge  in  at  most  t  iterations,  reducing 
the  degree  of  un  by  at  least  one  with  each  successive  iteration. 

To  the  extent  that  there  is  a  function  u"  for  a  specific  problem  which  is  well  approx¬ 
imated  by  a  low  order  polynomial,  and  to  the  extent  that  the  normalized  kernel  for  a 
specific  problem  is  well  approximated  by  a  convolution,  one  would  expect  Richardson’s 
algorithm  with  stochastic  preconditioning  to  converge  rapidly. 

3.3  Richardson’s  Algorithm  and  Iterative  Regularization 

Integration  tends  to  smooth.  It  is  clear,  therefore,  that  the  Richardson  iterations  (3.3) 
and  (3.2)  will  tend  to  produce  smooth  iterates  when  applied  to  integral  equations  of  the 
first  kind  having  smooth  kernels.  There  is  regularization  implicit  in  using  Richardson’s 
algorithm,  and  it  is  the  purpose  of  this  section  to  examine  the  nature  of  this  regularization 
for  the  special  case  of  the  Richardson  algorithm  (3.3),  applied  to  matrix  equations  with 
matrices  having  a  diagonalizable  nullspace.  We  have  in  mind  matrix  equations  which  arise 
from  the  discretization  of  ill-posed  integral  equations  of  the  first  kind,  so  that  the  more 
oscillatory  eigenvectors  correspond  to  the  many  small  eigenvalues  of  the  matrix. 
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3.3.1  Regularization  Methods 

One  approach  to  ‘solving’  a  matrix  equation  K f  =  g  which  is  the  discretization  of  an 
ill-posed  linear  operator  equation  is  the  method  of  regularization  of  Tikhonov  (1962)  and 
Phillips  (1963)  (see  also  Tikhonov  and  Arsenin,  1977,  and  Groetsch  1984).  The  basic  idea 
is  very  simple.  We  do  not  want  to  solve  any  discretized  version  of  an  ill-posed  equation 
exactly.  Instead,  we  minimize  the  quadratic  form 

U(z)  =  ( Kz  -  g)’(Kz  -g)  +  7 z’Lz,  (3.129) 

where  L  is  positive  definite,  and  is  chosen  so  that  z^Lz  will  tend  to  be  large  when  z  is 
not  smooth.  A  positive  constant,  7,  determines  the  relative  importance  of  the  first  ( least- 
squares )  and  second  ( penalty )  terms  of  the  functional  U(z).  When  7  is  zero,  minimizing 
U(z)  is  equivalent  to  minimizing  \Kz  -  £|.  As  7  is  increased,  increasing  weight  is  put  on 
the  smoothness  of  the  solution,  and  less  on  ‘fidelity’  to  the  equation. 

The  quadratic  form  (3.129)  is  usually  associated  with  the  method  of  regularization. 
We  will  instead  be  concerned  with  the  functional 

U(z)  =  (z-  f)'(z-  f)  +  1Z’Lz.  (3.130) 

Although  we  will  not  require  L  to  be  positive  definite,  our  motivation  for  choosing  L  is 
the  same  as  in  (3.129).  Minimizing  t/,  like  minimizing  U,  involves  a  compromise  between 
fidelity  to  the  equation  and  smoothness  of  the  solution.  The  difference  is  that  the  first  term 
in  U  measures  how  close  the  right  hand  side  corresponding  to  an  approximate  solution  is 
to  g,  while  the  first  term  of  U  compares  the  approximate  solution  to  a  solution  vector  /. 

3.3.2  Regularization  Implicit  in  Richardson’s  Algorithm 

Consider  the  Richardson  iteration  (3.3)  applied  to  equations  K f  =  g  for  which  K  is  an 
m  x  m  matrix  with  a  diagonalizable  nullspace  (Section  2.1.4).  Assume  that  the  iteration 
converges  for  a  particular  choice  of  9.  We  have  shown  (Corollary  3.1.1)  that  since  (3.3) 
converges,  it  must  converge  for  any  /°,  so  we  can  choose  f°  arbitrarily,  and  denote  the 
corresponding  solution  by  /.  We  will  show  that,  under  these  conditions, 

6n  =  /"+1  -  /"  =  0(g  -  Kfn )  (3.131) 

is  a  stationary  point  of  the  quadratic  form 

Q(z)  =  Qls(z)  +  QP(z)  s  (u»  -  z)*(un  -z)  +  z'(K*/9  -  I)z ,  (3.132) 

where  A  *  is  the  group  inverse  of  A’,  which  exists  and  is  unique  since  K  has  a  diagonal¬ 
izable  nullspace,  and 

un  =  /-/".  (3.133) 

Differentiating  Q(z)  with  respect  to  z  and  setting  this  derivative  equal  to  zero,  we 
note  that  a  stationary  point  z  must  satisfy  the  linear  relationship 

I\* z  -  9un  =  0.  (3.134) 

The  matrix  A'  has  index  1  =  1,  and  hence  there  exists  a  nonsingular  matrix  B  such  that 

A  =  B~l  JQ  ®  B ,  (3.135) 

•  . 
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where  J  is  a  Jordan  form  matrix  of  blocks  corresponding  to  nonzero  eigenvalues  of  K . 
The  group  inverse  K *  is  then 


A'#  =  B~x 


J-1  0 
0  0 


B , 


(3.136) 


and  we  have,  in  the  notation  of  Section  3.1.2  and  Lemma  3.1.1,  that 

K*K  =  KK*  =  B'lJBi.B-lJ-lBt.  =  BlBv  =  V.  (3.137) 


Substitute  6n  for  z  in  (3.134)  and  use  Lemma  3.1.1  (d)  to  get 

d  =  K*z  -  8un  =  -8(1  -  K#It)un  =  -8(1  -  V)un  €  Af(A').  (3.138) 


We  will  show  next  that  un  €  Tl(K),  so  that,  using  once  again  Lemma  3.1.1,  un  =  Vhn 
for  some  vector  An,  and  hence  d  =  0. 

The  vector  fn  —  f°  can  be  expressed  as  a  sum  of  steps  S' 

fn-f°  =  £5?  V  for  n  >  0,  (3.139) 

where  S'  =  8Ku'  £  H(K)  for  every  i.  Therefore  fn  —  f°  £  H(K)  for  every  n.  Since 

lim  (r-/°)  =  /-/°,  (3.140) 

n— »oc 

/  -  f°  £K(K).  Hence 

«n  =  /  -  r  =  (/  -  f°)  +  (f  ~  fn)  €  TZ(K),  (3.141) 

which  completes  the  proof  that  6n  is  a  stationary  point  of  (3.132). 

Lemma  3.1.1  (c)  implies  that  we  can  express  /  and  f°  as 

/  =  Vf  +  (I-V)f  =  fR  +  fN  (3.142) 

and 

f  =  Vf  +  (I-V)f0  =  f0R  +  f%,  (3.143) 

where  V  and  (I  -  V )  project  onto  Tv(A')  and  N(K),  respectively.  Since  /"  —  f°  €  'R-(K) 
we  have,  for  all  n  >  0, 


f°=(I-V)r  =  fN.  (3.144) 

We  can  obtain  a  simple  expression  for  (3.132)  evaluated  at  6n  in  terms  of  un  and  un+1. 
Since 

Un  -  sn  =  (/  -  /")  -  (/n+1  -  D  =  un+1,  (3.145) 

we  see  that 

Qls(8u)  =  !«n+1||2-  (3.146) 

Using  basic  properties  of  the  group  inverse  (Section  2.1.4),  straightforward  (though  some¬ 
what  tedious)  algebra  leads  to 

Q(8n)  =  (un+1)*un.  (3.147) 

Note  the  similarity  between  (3.130)  and  (3.132).  We  have  shown  that  each  step  (3.131) 
corresponds  to  solving  a  penalized  least  squares  problem,  where  the  penalty  term  Qp  is 
determined  by  the  matrix  A',  and  the  ‘least  squares’  term  Qis  is  ||un  -  £n||,  where  un  = 
/-  /n.  Further  discussion  of  the  relationship  between  linear  smoothers  and  penalized  least 
squares  can  be  found  in  Buja,  et.  al.  (1989).  The  notion  that  there  can  be  regularization 
implicit  in  iterative  algorithms  is  apparently  due  to  Bakushinskii  (1967). 
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3.3.3  Positive  Definite  I\ 


Although  (3.2)  does  not  make  explicit  use  of  regularization,  at  each  iteration  regularization 
is  implicit  in  this  algorithm  and  the  character  of  this  regularization  is  determined  by  the 
matrix  K.  To  see  how  the  second  term  in  (3.132)  can  penalize  ‘rough’  iterates,  we  consider 
the  simple  special  case  of  K  mx  m  positive  definite,  though  with  many  small  eigenvalues. 
Let  the  (positive)  eigenvalues  of  K  be  {A,}^  and  let  the  corresponding  orthonormal 
eigenvectors  be  By  the  spectral  theorem  (Theorem  2.1.1), 

m 

K  =  (3.148) 

t=i 


where 


A1  >A2>...>  Am>0. 


Assume  that  /  =  and  0  <  0  <  2/Aj,  so  that  fn  —>  f  for  all  f°.  Let  the  expansions 

of  6n  in  terms  of  the  eigenvectors  of  K  be 


i=l 


(3.149) 


In  terms  of  the  spectral  decomposition  (3.148)  of  A',  the  penalty  term  at  the  minimum 
becomes 


QP(Sn)  =  6n-(K~'/0  -  I)6n  =  £>?)2  [(0Aj)-1  -  1 

«= l 


(3.150) 


Since  the  matrix  K  is  a  discretization  of  a  smooth  function,  the  more  oscillatory  eigen¬ 
vectors  will  correspond  to  small  eigenvalues.  Components  of  6n  in  the  directions  of  these 
highly  oscillatory  eigenvectors  will  have  a  large  contribution  to  the  penalty  term,  hence 
the  minimum  of  Q  will  tend  to  occur  at  a  vector  6n  which  has  small  components  in  the 
direction  of  the  ‘rougher’  eigenvectors  -  that  is,  6n  will  tend  to  be  smooth  if  K  is  smooth. 


3.4  Examples 

We  illustrate  the  ideas  of  this  chapter  by  considering  linear  Fredholm  and  Volterra  exam¬ 
ples. 


3.4.1  A  Fredholm  Example 

In  Chapter  1,  we  introduced  a  Fredholm  integral  equation  of  the  first  kind, 


k\(x,y)f{y)dy  =  g(  x). 


with  kernel 


J/(l  -  *)  y  <  x 
x(l  -  y)  y>x 


We  continue  the  discussion  of  this  example  which  began  in  that  chapter. 


(3.151) 


(3.152) 
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An  Eigenfunction  Analysis 

For  the  very  simple  example  (3.151),  we  can  determine  the  eigenfunctions  and  eigenvalues 
of  both  the  kernel  (3.152)  and  the  preconditioned  kernel 


7  ,  x  _  2fci( 

kl(x,y)=  -rr- 
1(1 


X,y)  _  (  y/x 
-*)  "  1  (1- 


y  <  x 

y)/(i-*)  y>* 


(3.153) 


Here  we  are  transforming  the  kernel,  and  then  comparing  the  eigenfunctions  of  the  trans¬ 
formed  kernel  (3.153)  with  the  kernel  (3.152).  In  numerical  examples  we  will,  as  discussed 
in  Section  3.2.2,  discretize  (3.152)  to  get  a  matrix  equation,  and  then  premultiply  both 
sides  of  this  equation  by  the  appropriate  matrix  D,  so  that  the  matrix  becomes  stochastic. 

Since  (3.152)  is  the  Green’s  function  for  the  differential  equation 


d2g 

+ 


subject  to  the  boundary  conditions 


ff(0)  =  g(l)  =  0, 


(3.154) 


(3.155) 


the  eigenfunctions  of  (3.151)  are  the  same  as  those  of  the  differential  equation  (3.154), 
subject  to  the  boundary  conditions  (3.155).  That  is,  the  tth  eigenfunction  of  (3.151)  is 


4>t(x)  =  sin(t7rx). 


(3.156) 


and  the  corresponding  eigenvalue  is 


A,  =  (3.157) 

The  key  to  the  eigenfunction  analysis  of  the  preconditioned  Fredholm  operator  with 
kernel  (3.153),  is  to  note  that 


<3158> 

It  follows  that  tth  degree  polynomials  are  transformed  into  tth  degree  polynomials  by  the 
integral  equation  with  kernel  (3.153).  It  turns  out  that  the  eigenfunctions  are  polynomials, 


Mx)  =  ^a£,,x\ 


(3.159) 


The  eigenvalues  are 

‘  <(t+l)’ 

and  the  coefficients  in  (3.159)  can  be  determined  recursively  from  the  formulas 


«<,.  = 


<*t,t  =  1> 

1-A./V 


(3.160) 


(3.161) 


After  scaling  (3.152)  by  multiplying  by  x2,  so  that  both  (3.152)  and  (3.153)  have 
largest  eigenvalue  one,  we  note  that  the  eigenvalues  corresponding  to  the  preconditioned 
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equation  are  substantially  larger  than  the  eigenvalues  corresponding  to  the  kernel  (3.152), 
particularly  for  moderate  t.  In  the  numerical  examples  to  follow,  we  will  iteratively  solve 
a  matrix  equation  having  a  stochastic  matrix,  consequently  the  largest  eigenvalue  of  this 
matrix  will  equal  one. 

Another  thing  to  note  from  this  example  is  that  the  ‘character’  of  the  eigenfunctions 
is  completely  changed  -  from  trigonometric  functions  (all  of  which  equal  zero  at  the 
endpoints)  to  polynomials  -  by  the  stochastic  preconditioning. 

A  Numerical  Investigation 

In  this  subsection,  we  describe  some  numerical  results  on  the  equation  (3.151).  The 
computations  were  performed  using  the  S  programming  language  (Becker,  Chambers  and 
Wilks,  1988),  and  a  software  listing  is  in  Appendix  B. 

Let  the  right  hand  side  of  (3.151)  be 

gi(x)  =  x3(l-x)2.  (3.162) 

We  discretize  the  integral  equation  (3.151)  with  kernel  (3.152)  and  right  hand  side 
(3.162)  using  50  point  Gauss- Legendre  quadrature  as  discussed  in  Appendix  A.  Let  the 
matrix  of  this  discretized  equation  be  denoted  A'i ,  and  let  the  corresponding  precondi¬ 
tioned  matrix  be  R\.  The  matrix  R\  is  formed  by  discretizing  the  kernel  (3.152)  as  in 
Appendix  A,  and  then  normalizing  the  rows  of  this  matrix  to  each  sum  to  one  (see  Section 
3.2.2).  The  largest  eigenvalue  of  A'i  is  .1013913,  which  is  approximately  equal  to  n~2,  the 
largest  eigenvalue  of  the  corresponding  integral  equation.  For  the  Richardson  iteration 
without  preconditioning  (3.3),  we  take  9  to  equal  the  reciprocal  of  the  largest  eigenvalue, 
i.e.  9  as  9.863,  so  that  the  largest  eigenvalue  of  9R\  is  (very  nearly)  equal  to  one.  For  the 
Conditional  Expectation  algorithm  (Richardson’s  algorithm  (3.2)  with  stochastic  precon¬ 
ditioning)  the  largest  eigenvalue  is  equal  to  one,  so  we  let  9  =  1.  We  choose  the  initial 
iterate  f°  =  0  for  now;  we  will  consider  the  important  role  of  f°  for  the  algorithm  without 
preconditioning  below.  Fifty  iterations  of  both  methods  are  compared  in  Figure  3.1.  The 
preconditioned  method  gives  an  approximation  very  near  the  solution 

f(x)  =  -20x3  +  24x2  -  6x  (3.163) 

before  the  convergence  rate  begins  to  decrease  dramatically.  The  method  without  pre¬ 
conditioning  is  still  far  from  the  solution  at  the  50th  iteration,  and,  since  by  the  50th 
iteration  the  steps  taken  at  each  iteration  are  very  small,  it  will  take  many  iterations  to 
get  appreciably  closer  to  the  solution. 

Another  way  of  seeing  the  dramatic  effect  preconditioning  has  had  on  the  convergence 
rate  is  to  examine  the  distance,  in  h  norm,  to  the  discretized  solution  as  a  function  of  the 
iteration  index.  This  comparison  is  made  in  Figure  3.2a.  In  Figure  3.2b,  we  have  plotted 
the  residual  norms  flA’/n  -  si  (for  the  discretized  functions,  using  the  norm). 

The  eigenvectors  of  (3.152)  are  sin(/7rx),  which  equal  zero,  for  all  l,  at  x  =  0  and  x  =  1. 
However,  /(l)  =  -2,  so  contributions  from  eigenvectors  of  R\  corresponding  to  very  small 
eigenvalues  are  required  in  order  for  Richardson’s  algorithm  without  preconditioning  to 
closely  approximate  /(x)  near  x  =  1.  The  eigenfunctions  (3.159,  3.161)  corresponding  to 
the  Conditional  Expectation  algorithm  are  polynomials,  and  they  do  not  all  go  to  zero  at 
the  endpoints  of  [0,  1). 
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One  might  argue  that  the  comparison  in  Figures  3.1  and  3.2  is  unfair,  since  the  eigen¬ 
functions  of  (3.152)  are  ill-suited  for  approximating  (3.163),  at  least  when  f°  =  0.  One 
way  to  compare  the  two  algorithms  on  a  more  even  footing  is  to  use  the  starting  function 

J°(x)  =  -2*,  (3.164) 

so  that  f°( 0)  =  /( 0)  and  /°(1)  =  /( 1).  (Of  course,  in  practice  one  usually  does  not  know 
the  value  of  the  unknown  function  at  the  endpoints.)  Fifty  iterations  of  both  algorithms, 
begining  with  the  starting  iterate  (  3.164),  are  displayed  in  Figure  3.3.  The  distance  from 
the  solution  and  residual  norm,  as  functions  of  the  iteration  index,  are  given  in  Figure 
3.4.  The  methods  both  perform  reasonably  well,  with  Richardson  initially  doing  better, 
but  with  the  Conditional  Expectation  algorithm  ‘catching  up’  after  30  or  40  iterations. 

These  two  numerical  examples  each  illustrate  the  notion  of  ‘near-convergence’  and 
‘near-solution’.  The  Conditional  Expectation  algorithm  is  able  to  provide  smooth  ap¬ 
proximate  solutions  which  are  close  to  the  solutions  of  the  continuous  problem  (Figures 
3.2a  and  3.4a),  and  for  which  the  corresponding  residuals  || K fn  —  g||  are  small  (Figures 
3.2b  and  3.4b).  The  Richardson  algorithm  also  provided  smooth  iterates,  although  in  the 
first  example  the  Richardson  approximations  are  very  slowly  convergent  near  x  =  1. 

Both  the  Richardson  and  the  Conditional  Expectation  algorithms  produce  smooth 
approximate  solutions  even  with  the  inevitable  error  in  the  right  hand  side.  This  is  an 
instance  of  the  idea  of  iterative  regularization  discussed  in  Section  3.3.1.  However,  even¬ 
tually  the  approximations  may  become  less  smooth,  as  the  components  of  the  right  hand 
side  in  the  directions  of  eigenvectors  corresponding  to  smaller  eigenvalues  begin  to  con¬ 
tribute.  Since  the  right  hand  side  for  this  example  is  smooth,  and  since  preconditioning  has 
reduced  the  condition  number  substantially  (from  156261  to  810.34),  it  would  take  many 
iterations  to  observe  the  approximations  depart  from  the  true  solution,  and  even  then  the 
deviation  would  be  slight.  In  order  to  see  an  effect  in  a  reasonable  number  of  iterations, 
we  add  a  component,  with  coefficient  .01,  in  the  direction  of  the  25th  singular  vector  of 
the  matrix  K\  to  the  right  hand  side  (3.162).  This  leads  to  a  perturbed  right  hand  side, 
the  Fourier  coefficients  of  which  are  presented  in  Figure  3.5a,  and  a  plot  of  which  is  given 
in  Figure  3.5b.  In  Figure  3.6a,  we  display  50  iterations  of  the  Conditional  Expectation 
algorithm  with  this  perturbed  right  hand  side,  and  in  Figure  3.6b,  we  give  the  solution  of 
the  matrix  equation  obtained  by  matrix  inversion.  Some  obvious  points  to  be  made  here 
include  the  oscillatory  nature  of  the  25th  singular  vector,  as  reflected  in  the  ‘noisy’  right 
hand  side  in  Figure  3.5b,  and  the  unpleasant  solution  in  Figure  3.6b.  In  Figure  3.6a,  we 
see  a  dramatic  illustration  of  near-convergence,  as  the  Conditional  Expectation  approxi¬ 
mations  stay  reasonably  close  to  the  discretized  solution  to  the  (unperturbed)  continuous 
problem.  With  some  smoothing  of  the  steps  /n+1  -  /"  (as  discussed  in  Chapter  7),  much 
of  the  roughness  of  the  approximations  in  Figure  3.6a  can  be  eliminated.  The  distances 
of  the  approximate  solutions  from  both  the  perturbed  and  unperturbed  right  hand  sides 
are  given  in  Figure  3.7a,  and  the  corresponding  residual  norms  are  in  Figure  3.7b.  Notice 
that  the  approximations  are  closest  in  norm  to  this  solution  at  the  8th  iteration,  and  that 
the  corresponding  residual  norm  at  the  8th  iteration  is  fairly  small.  From  that  point  on, 
the  iterations  move  further  away  from  the  solution  which  corresponds  to  the  unperturbed 
right  hand  side  as  they  approach  the  exact  solution,  which  corresponds  to  the  perturbed 
right  hand  side.  However,  the  residual  norm  with  respect  to  the  unperturbed  right  hand 
side  continues  to  slowly  decrease  until  about  the  30th  iteration. 
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3.4.2  A  Volterra  Example 

As  an  example  of  a  Volterra  equation, 


/  k2(x,y)f(y)dy  =  g2(x), 
Jo 


we  take  the  differentiation  problem,  with  kernel 


:>i0  —  | 


(z  -  y)Q  for  y  <  x 
0  for  y  >  x  ’ 


for  o  >  —1,  and  with  right  hand  side  given  by  the  power  series 

OO 

92(x)  =  Yla»x*- 
*= 0 

If  a  is  a  nonnegative  integer,  then  the  solution  to  this  equation  is 

f(x)  = 


(3.165) 


(3.166) 


(3.167) 


(3.168) 


where 

0(0'  =  g'(0)  =  •  •  •  =  0<°>(O)  =  0.  (3.169) 

This  example  is  useful  because  it  is  easy  to  examine  the  Conditional  Expectation  algorithm 
analytically. 

To  piecondition  the  kernel,  we  divide  (3.166)  by 

Jr\  rx  j.a+1 

'  k2(x,y)dy  =  /  (z  -  y)°dy  =  — — (3.170) 
o  Jo  a  +  1 


and  we  denote  the  quotient  k2{x,y).  For  any  t  >  0 

fX  k2{x,y)ytdy  =  *1?  +  + 

Jo  K  y>y  y  T(a  +  t  +  2)  ’ 

hence  the  eigenfunctions  of  the  preconditioned  kernel  are  the  powers 

0((z)  =  z‘ 

for  t  =  0,1,...,  and  the  corresponding  eigenvalues  are 

r(Q  +  2)r(t  +  i) 

Vt  r^a  +  t  +  2)  ‘ 


(3.171) 


(3.172) 


(3.173) 


For  example,  let  a  =  0.  A  little  algebra  shows  that,  if  0(z)  =  x’+1  /(s  +  1),  then  the 
corresponding  f71  an  given  by 


r(z)  =  [i-(i-i/(5  +  i))"]z*. 


(3.174) 


Without  preconditioning,  it  is  easy  to  show  that  the  Richardson  iteration  does  not  con¬ 
verge  for  this  example,  regardless  of  6.  From  the  linearity  of  the  Volterra  integral  operator 
and  (3.174)  we  see  that,  for  the  right  hand  side  (3.167), 


/"(*)  =  £**.11  “  (1  -  l/a)n]z*-1. 


(3.175) 
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If  g  is  a  smooth  function  plus  noise,  then  /"  will  reflect  the  smooth  components  initially, 
since  these  will  correspond  to  fairly  small  values  of  s.  Eventually,  the  solution  will  become 
rougher,  but  only  when  (1  —  l/s)n  becomes  small  for  fairly  large  s. 

Numerical  experimentation  suggests  that,  for  reasonably  smooth  right  hand  sides,  the 
iterative  algorithm  outlined  in  this  section  can  be  useful  for  numerical  differentiation.  We 
will  discuss  the  numerical  solution  of  an  equation  related  to  the  Volterra  equation  with 
kernel  Ie2(x,y)  with  a  =  1/2  in  Chapter  7. 
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Figure  3.1a:  Fifty  Iterations  of  Richardson’s  Algorithm 
Without  Preconditioning _ 


Iteration 


1*2  Norm  ||  g  -  g(x)[n]  || 
0.0005  0.0100 


1-2  Norm  ||  f  -  f{x)[n]  )| 
0.05  0.50  5.00 
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Figure  3.2a:  Convergence  to  Solution  for  Richardson  and 
_  Conditional  Expectation  Algorithms 
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Figure  3.3a:  Fifty  Iterations  of  Richardson’s  Algorithm 
_ Without  Preconditioning  (f(x)[0]  =  -2x) 


Iteration 
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Figure  3.6a:  Fifty  Iterations  of  Conditional  Expectation  Algorithm 

(Perturbed  RHS) _ 


Iteration 


Figure  3.7a:  Distances  to  Solutions  for  Perturbed  and 

Unperturbed  RHS _ 


Chapter  4 

The  Conditional  Expectation 
Algorithm  for  Nonlinear  Integral 
Equations  with  Peaked  Kernels 


4.1  A  Nonlinear  Equation 

All  of  the  integral  equations  which  we  will  consider  can  be  expressed  as  integral  equations 
of  the  first  kind  of  the  form 


/  k{x,y,f[<f>(x,y)]}dy  =  g(x),  (4.1) 

Jo 

where  the  kernel  k  and  the  function  <t> :  (0, 1]  x  [0, 1]  — ►  [0, 1]  are  known,  and  k  is  nonneg¬ 
ative  and  bounded.  To  fix  ideas,  we  will  restrict  attention,  for  the  most  part,  to  kernels 
defined  on  the  unit  square.  However,  this  restriction  is  not  essential. 

We  will  often  need  to  refer  to  the  kernel  of  (4.1)  and  to  the  derivative  with  respect  to 
its  third  argument,  so  we  introduce  the  following  notation  to  save  writing: 


(4.2) 


and 


dk(x,y,z)\ 
dz 


U=/[(j>(r,y)) 


=  k'{x,y,f). 


(4.3) 


In  addition  to  requiring  that  k  be  nonnegative  and  bounded,  we  also  require  k'  to 
be  nonnegative  and  bounded.  The  reason  for  this  is  that  we  will  introduce  an  iterative 
algorithm,  based  on  applying  the  Conditional  Expectation  algorithm  of  Section  3.2.3  to 
linearizations  of  (4.1),  which  can  be  motivated  by  considering  k'  to  be  proportional  to  a 
bivariate  probability  density. 


4.1.1  Peaked  Kernels 

We  will  restrict  attention  to  integral  equations  with  kernels  having  certain  features,  which 
we  summarize  here.  A  kernel  k  will  be  said  to  be  peaked  if  k  is  nonnegative  and  bounded, 
and  k'  has  the  following  properties 

1.  k'(x,y,f)  >  0  for  all  x  and  y, 
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2.  Jq  k'(x,y,f)dy  <  oo  for  all  x,  and 

3.  There  exists  a  monotone  function  t  :  [0, 1]  — *  [0, 1],  such  that,  for  all  x, 

A  nonlinear  kernel  can  be  peaked  for  some  values  of  /  and  not  for  others.  When  we  refer 
to  a  kernel  as  being  peaked,  we  intend,  somewhat  imprecisely,  for  this  to  mean  that  this 
kernel  is  peaked  for  functions  /  of  interest. 

A  simple  (and  important)  example  of  a  peaked  kernel  is  the  kernel, 

s  «>(*>y)/(y),  (4.4) 

of  a  linear  Fredholm  equation,  where  w  is  nonnegative,  bounded,  and  peaked  along  the 
line  x  =  y.  The  kernel  (4.4)  is  linear  in  /  and,  as  discussed  in  Section  3.2.3,  its  derivative 

k'(x,i /,/)  =  tn(x,y)  (4.5) 

is  proportional  to  the  joint  density  of  two  random  variables  X  and  Y .  To  the  extent  that 
tv  is  peaked  along  x  =  y,  we  can  say  that  Y  as  X.  The  definition  of  a  peaked  kernel 
attempts  to  generalize  this  idea  to  kernels  with  derivatives  with  respect  to  /  which  can 
be  regarded  as  bivariate  probability  densities  for  which  Y  ~  t(X ),  for  some  monotone 
function  y  =  fix)  in  the  unit  square.  We  will  consider  a  simple  example  for  which  t(x)  is 
not  the  identity  in  Section  4.3.2. 

Note  the  slight  difference  in  terminology  between  Chapter  3  and  Chapter  4.  If  we 
are  restricting  attention  to  linear  equations,  it  is  natural  to  refer  to  w(x,y)  as  the  kernel, 
since  the  relationship  between  w(x,y)  and  k(x,y,f)  is  the  same  for  any  linear  problem. 
However,  when  we  regard  a  linear  equation  as  merely  a  special  case  in  a  class  of  nonlinear 
equations,  then  we  will  refer  to  k(x,y,f)  as  the  kernel,  where  k'(x,y,f )  =  tv(x,y). 


4.2  Newton’s  Method 

When  solving  a  system  of  nonlinear  equations  the  method  of  choice  is  often  Newton’s 
method.  Newton’s  method  is  known  to  converge  for  a  ‘good  enough’  starting  value  and 
to  converge  quadratically  in  most  cases.  The  sufficient  conditions  for  convergence  and 
the  rate  of  convergence  of  Newton’s  method  are  provided  by  the  Newton-Kantorovich 
theorem,  (e.g.,  Ortega,  1972)  and  this  theorem  is  proved  in  Banach  space.  In  particular, 
Newton’s  method  is  a  useful  algorithm  for  nonlinear  integral  equations  in  Li . 

4.2.1  The  Frechet  Derivative 

In  order  to  extend  the  definition  of  Newton’s  method  to  functional  (in  particular,  integral) 
equations,  a  concept  of  functional  derivative  is  necessary.  We  introduce  here  one  such 
derivative,  the  Frechet  derivative.  We  give  here  the  definition,  in  Banach  space,  following 
Debnath  and  Mikusiriski  1990,  p.416). 

Definition  4.2.1  (Frechet  Derivative)  Let  B\  and  Bi  be  Banach  spaces,  and  let  x  £ 
B\  be  fixed.  A  continuous  linear  operator  A  \  B\  —*  Bi  is  called  the  Frechet  derivative 
of  an  operator  T  :  B\  — *  Bi  at  x  if 

T{ x  +  h)  -  T(x)  =  Ah  +  $(x,h), 
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and 


lim 
IM— o 


m^h)i 

m 


=  0. 


The  Frechet  derivative  at  x  o/T  will  be  denoted  T'( x). 


It  can  be  easily  shown  that  if  the  Frechet  derivative  exists,  then  it  is  unique  (Debnath 
and  Mikusinski  (1990,  p.417). 

Define  T  :  — » ►  L2  by 

T(/)=  [' k(x,yj)dy-g.  (4.6) 

Jo 

The  Frdchet  derivative  of  T  is 

T'(f)h=  f  k'(x,yj)h(y)dy.  (4.7) 

Jo 


Now  that  we’ve  extended  the  notion  of  derivative  to  integral  operators  of  the  form  (4.1), 
we  can  state  what  Newton’s  method  is  for  this  equation. 


4.2.2  The  Newton-Step  Equation 

Let  T  be  an  operator  between  Hilbert  spaces,  and  let  fn  be  a  point  in  the  domain  at 
which  T  is  Frechet  differentiable.  We  would  like  to  approximately  determine  a  function 
/  such  that  T(f)  =  0  (where  here  ‘0’  denotes  the  function  which  is  identically  zero),  and 
we  assume  that  /"  is  ‘near’  /.  Expand  T  about  /"  to  first  order  in  a  Taylor  series,  giving 

r(/)-r(/")»nrx/-/n).  (4.8) 

But  T(f)  -  0,  so  we  have  the  approximate  linear  equation 

T(/n)«-n/nK/-n  (4.9) 

relating  /"  to  /.  Let  fn+1  be  a  value  of  /  which  makes  (4.9)  an  equality.  Solving  the 
linear  operator  equation  (4.9)  for  fn+l  constitutes  one  step  of  Newton's  method. 

For  (4.1),  let 

hn  s  /n+I  -  /".  (4.10) 

The  function  hn  is  a  solution  to  the  following  Newton-step  equation: 

9-  I  k(x,y,  fn)dy  =  f  k'(x,y,  fn)hndy.  (4.11) 

Jo  Jo 

Solving  the  linear  integral  equation  (4.11)  for  hn  is  in  general  difficult.  We  will  instead 
investigate  a  quasi-Newton  iterative  algorithm,  in  which  we  use  one  or  several  steps  of  the 
Conditional  Expectation  algorithm  of  Section  3.2  as  an  easily  determined  approximate 
Newton-step.  By  doing  this,  we  replace  a  quadratically  convergent  algorithm  with  a 
linearly  convergent  algorithm,  but  since  the  steps  of  the  quasi-Newton  algorithm  are,  by 
design,  very  easy  to  calculate,  we  are  often  better  off  using  the  linearly  convergent  method. 

There  is  another  reason  to  want  to  use  an  approximate  Newton-step.  Recall  from  the 
discussion  of  Section  2.4.2,  that  the  exact  solution  of  any  numerical  representation  of  an 
ill-posed  integral  equation  of  the  first  kind  is  likely  to  be  either  nonexistent,  or  else  quite 
different  from  any  solution  to  the  integral  equation.  For  an  ill-posed  nonlinear  problem, 
Newton ’8  method  involves  the  exact  solution  of  an  ill-posed  linear  equation  at  each  step. 
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4.3  The  Conditional  Expectation  Algorithm 

In  Section  3.2,  we  motivated  a  specific  preconditioned  Richardson  algorithm  for  a  linear 
integral  equation  of  the  first  kind.  We  now  propose  extending  this  algorithm  in  order  to 
iteratively  approximate  solutions  of  nonlinear  integral  equations.  Because  of  the  proba¬ 
bilistic  motivation  of  Section  3.2.3,  we  will  refer  to  this  algorithm,  whether  applied  to 
linear  or  to  nonlinear  equations,  as  the  Conditional  Expectation  algorithm.  To  illustrate 
this  algorithm,  we  first  consider  the  special  case  of  (4.1)  where  4>(x,y)  =  y,  k  is  peaked, 
and  t(x)  ss  x.  Following  this,  we  suggest  how  the  method  can  be  extended  to  some  more 
general  problems. 


4.3.1  A  Simple  Case 
Let 

f  k[x,y,f(y)]dy  =  g(x),  (4.12) 

Jo 

where  A:  is  a  peaked  kernel  with  t(x)  ss  x.  We  propose  attempting  to  solve  (4.12)  using  a 
nested  iteration,  in  which  the  outer  iteration  is  an  approximate  Newton  method,  with  the 
approximate  Newton  step  provided  by  the  inner  iteration.  Since  the  Newton  step  equation 
is  linear,  we  can  use  Richardson’s  algorithm  with  stochastic  preconditioning  (Section  3.2) 
in  order  to  approximately  determine  the  Newton  steps.  We  call  this  nested  algorithm  the 
Conditional  Expectation  algorithm,  and  we  note  that  it  reduces  to  the  algorithm  (of  the 
same  name)  discussed  in  Section  3.2  when  (4.12)  is  linear  in  /. 

Let  the  cuter  iteration  be  indexed  by  n,  and  let  the  inner  iteration  be  indexed  by  s, 
for  s  =  1,. . . ,/.  Actually,  the  inner  iteration  limit  can  depend  on  n,  but  we  will  not  state 
the  algorithm  in  this  much  generality  in  order  to  keep  the  notation  as  simple  as  possible. 
Write  the  Newton-step  equation,  at  the  nth  outer  iteration,  as 


rn  =  g-  k{x,y,fn)dy  =  k'(x,yjn)hn(y)dy. 

Jo  Jo 

(4.13) 

Approximate  hn  by  hn'1  where,  since  t(x)  =  x. 

hn'a+l(y)  =  hn,s(y)  +  tf*(y)  *  hn'a(y)  +  6a(x ), 

(4.14) 

/in,°  is  arbitrary,  and 

5,(i)_  rn(x)  -  fg  k'(x,y,fn)hn's(y)dy 
fo  k’(x,y,fn)dy 

(4.15) 

A  simple  case,  which  is  often  useful,  is  to  let  /  =  1,  and  to  take  hn'° 
Conditional  Expectation  iteration  is  then 

=  0  for  all  n.  The 

rn 

/"+ 1  =  fn  + 

/o  k’{x,y,fn)dy' 

(4.16) 

since 

hn'1  =  /i"’1  =  hn'°  +  6°  =  r". 

(4.17) 

Another  possibility,  only  slightly  more  complicated,  is  to  let  /  =  1,  h0,0  =  0,  and 


hn.o  =  hn-u.  (4  .18) 
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the  reasoning  behind  this  being  that  if  hn+l  «  /t",  the  (n  -  l)st  approximate  Newton  step 
might  provide  a  better  initial  approximation  to  the  nth  step  than  the  zero  vector.  This 
choice  of  an  inner  iteration  leads  to 


/n+l  =r  + 


r"  -  Jo*  k'(x,y,n[r  -  fn-x]dy 
Jo1  k'{x,y,fn)dy 


(4.19) 


The  iteration  (4.19)  makes  clear  the  relationship  of  the  nonlinear  algorithm  of  this  chapter 
to  the  linear  Conditional  Expectation  algorithm  of  Chapter  3. 

If  the  integral  equation  of  interest  is  ill-posed,  then  Newton’s  method  will  almost 
certainly  either  diverge,  or  else  converge  to  a  solution  of  the  discretized  equation  that  is 
far  from  any  solution  to  the  original  equation.  Because  of  this,  it  is  probably  a  good 
idea  to  keep  /  small;  one  exception  being  when  the  derivative  of  the  kernel  is  expensive 
to  compute.  Newton’s  method  has  been  used  to  motivate  the  iterative  algorithm  of  this 
chapter,  which  is  an  iteration  in  its  own  right.  Hence,  one  should  not  regard  the  closeness 
with  which  one  can  approximate  Newton  steps  as  an  overriding  consideration  in  using  the 
Conditional  Expectation  algorithm. 


4.3.2  An  Example  for  Which  t(x)  ^  x  and  <t>(x,y)  ^  y 
Consider  the  integral  equation 


/  ka(x,y)f{xy)dy  =  g(x), 
Jo 


(4.20) 


where 


ka{x,y)  =  | 


y(l  -  i°)  if  y  <  xa 
xa(l  -  y)  if  3/  >  i°,  ’ 


(4.21) 


and  a  /  1  is  a  parameter.  For  this  example  <t>(x,y)  =  xy  and  /(x)  =  x°.  The  Newton-step 
equation  is 

g(x)  -  [  ka(x,y)fn{xy)dy  =  (  ka(x,y)hn(xy)dy.  (4.22) 

Jo  Jo 

We  approximate  the  unknown  hn(xy)  by  a  function  which  is  constant  in  y  by  replacing  y 
with 


y.  =  t(x)  =  x°, 


(4.23) 


which  is  the  location  of  the  peak  in  y  for  each  x.  This  leads  to  the  Conditional  Expectation 
algorithm  step 

fc«/x«+i )  =  9(x)  ~  Jo  ka(x,y)fn{xy)dy 
Jo  ka(x,y)dy 


or  alternatively, 


hn^  _  g(x1/(o+1))  -  fp  ka/(o+l )(x,y)/n(x1/(a+1)y)dy 
fo  kc.na+\){x,y)dy 


(4.25) 
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4.3.3  The  General  Case 


The  example  of  the  previous  subsection  motivates  the  following  generalization  of  the 
Conditional  Expectation  algorithm.  For  the  general  case,  we  must  approximately  solve 
(4.11)  at  each  iteration,  and  we  rewrite  this  equation  as 

rn(x)  =  /  k'(x,y,fn[<t>(x,y)])hn[<t>(x,y)]dy.  (4.26) 

Jo 

Assume  that  k'(x ,  y ,  /")  has  a  peak  in  y,  for  given  x,  with  location  ym  =  t(x).  A  reasonable 
approximation  to  hn[<l>(x,y)]  might  be  hn[<j>(x,  u(x))],  where  u(x)  «  t(x),  and 


hn[4>(x,u(x)))  =  hn(z)  = 


_ rn(») _ 

fo  k'{x,y,fn[<t>(x,y)])}dy 


(4.27) 


In  order  to  provide  a  useful  approximation,  it  seems  reasonable  to  require  that  u(x)  have 
the  following  two  properties: 

1.  fc'{£,u(x),/n[d>(x,u(x))]}  is  ‘approximately’  equal  to 
maxv€[o,i j*'{x,y,/n[<£(x,y)]}  for  all  x. 

2.  4>\x ,  u(x)]  :  [0, 1]  — ►  (0, 1]  is  monotone  increasing,  with  <£(0,  u(0))  =  Oand  </>(l,  u(  1))  = 

1. 


If  <(x)  =  x,  then  u(x)  =  x  exactly  satisfies  both  of  the  above  conditions.  In  general, 
considerable  experimentation  may  be  required  in  order  to  determine  a  useful  function  u. 
In  Chapter  6,  we  discuss,  in  some  detail,  an  example  for  which 

#*•»> K  (rb)  (rb)  (428) 

and  t(x)  is  approximately  a  constant  function  over  most  of  the  range  of  x. 

Of  the  two  conditions  that  we  have  imposed  on  u,  the  requirement  that  <)>  be  a  mono¬ 
tone  increasing  function  mapping  the  unit  interval  into  itself  is  important;  the  other 
condition  is  merely  heuristic.  There  is  no  guarantee  that  the  ‘best’  choice  of  u  exactly 
maximizes  k'(x,u(x),fn).  A  practical  approach  might  be  to  first  try  simple  choices  of  u, 
however  crude,  and  see  what  happens. 


4.4  A  Simple  Numerical  Nonlinear  Example 


We  conclude  this  chapter  by  illustrating  the  Conditional  Expectation  algorithm  applied 
to  a  nonlinear  problem.  Let  v  be  a  given  differentiable  function  with  a  positive  derivative, 
and  consider  the  following  integral  equation: 


™(x,yMf{y)}dy  =  s(x), 


for  u?(x,y)  given  by  the  Green’s  function 


w(x,y) 


j/(l  -  x)  if  y  <  x 
x(l -y)  if  y  >  x, 


(4.29) 


(4.30) 
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which  we  used  in  the  examples  of  Section  3.4.1.  The  equation  (4.29),  though  linear  in 
v,  is  nonlinear  in  the  unknown  function  /,  and  of  the  form  (4.1).  The  derivative  of  the 
kernel  at  /  is 

k’(x,y,f)  =  w{x,y)v'(f).  (4.31) 

The  peak  of  w(x,y)  is  at  y  =  x ,  thus  it  is  not  unreasonable  to  assume  that  the  peak  of 
(4.31)  in  y  for  fixed  x  is  near  the  line  t(x)  =  x.  For  the  Conditional  Expectation  method, 
we  choose  /  =  1,  and  hn,°  =  0  for  all  n,  so  that  the  Conditional  Expectation  step  is 


hn, i,  )  _  9(x)  -  Jo  w(x,y)v[/n(y))dy 
1  7  Jo  v(x,y)v'[fn(y)]dy, 

and 

r+\y)  =  ny)  +  hn'\y). 

For  a  numerical  example,  we  choose 


(4.32) 

(4.33) 


«(*)  =  ex, 


(4.34) 


and 

f(x)  =  2x  -  1, 


so  that 


9(x) 


1  +  x(e2  -  1)  -  e2z 
4e 


(4.35) 

(4.36) 


Note  that  there  is  a  nontrivial  distinction  between  the  nonlinear  iteration  with  v(f)  =  e* , 
and  the  linear  case  with  v(f )  =  /.  For  this  example,  with  v(f)  =  e* ,  the  solution  /  and 
all  approximate  solutions  fn  must  be  everywhere  positive. 

We  take  f°  =  1  as  a  starting  value,  and  discretize  as  in  Appendix  A,  using  50-point 
Gauss-Legendre  quadrature.  The  first  50  approximations  to  the  solution  are  displayed  in 
Figure  4.1.  Initially,  the  algorithm  converges  rapidly,  although  eventually  the  convergence 
rate  becomes  very  slow.  From  the  plot,  in  Figure  4.2,  of  the  Z,2  distance  from  the  solution 
as  a  function  of  the  iteration  index,  it  is  clear  that  the  rate  of  convergence  begins  to 
decrease  substantially  after  only  a  few  iterations.  However,  convergence  is  sufficiently 
rapid  initially  that  approximations  are  near  the  solution  before  the  iteration  becomes 
slowly  convergent. 
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Fgure  4.1 :  Fifty  Iterations  of  the  Conditional  Expectation  Algorithm 

for  a  Nonlinear  Problem 


Iteration 


0.01 


L-2  Distance  ||f-f(x)[n]|| 

0.05  0.10  0.50  1.00  5.00 
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Figure  4.2:  Convergence  Rate  of  the  Conditional  Expectation  Algorithm 

for  a  Nonlinear  Problem 
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Chapter  5 

The  Behrens-Fisher  Problem 


5.1  Historical  Background 

The  Behrens-Fisher  problem  is  the  problem  of  comparing  the  means  of  two  normal  pop¬ 
ulations  with  no  assumptions  about  the  variances.  This  problem  has  received  much  at¬ 
tention,  and  caused  much  controversy,  because  it  is  the  simplest  example  of  any  practical 
importance  where  the  fiducial  (and  noninformative-prior  Bayesian)  and  Neyman-Pearson 
approaches  arrive  at  substantially  different  answers. 

Let  Xi,  i  =  1, . . . ,  rii  be  a  random  sample  from  a  N(fii , cf )  population,  and  let  Fj, »  = 
1,. . .  ,n2  be  a  random  sample  from  a  JV(/i2,<r 2)  population,  where  /ij,  /i2,  of,  and  <r2  are 
unknown.  Let  Xj  and  Sj,  for  j  =  1,2,  be  the  usual  sample  estimates  of  the  means  and 
variances  (defined  in  Section  2.6).  We  wish  to  test  the  composite  hypothesis 


Ho.fii  =  H2 

(5.1) 

against  the  alternative 

Hi  :  hi  >  hi, 

(5.2) 

where  o\!a\  or,  equivalently. 

e  _  °Vn\ 

a\!m  +  <T%/n2 

(5.31 

is  a  nuisance  parameter  of  special  importance. 

For  both  the  fiducial  and  the  frequentist  approaches  the  test  (of  size 
‘Reject  Hq  if  U  exceeds  a  critical  value  c0(Sj,52)\  where 

a)  is  of  the  form 

rr  _  A',  -  X2 

yJS'i/ n,  +  5|/n2 

(5.4) 

however  the  critical  values  for  the  tests  differ  for  the  two  approaches. 

A  fiducial  argument  suggests  determining  critical  values  for  the  test  from  the  inverse 
of  the  distribution  of  a  certain  linear  combination  of  two  Student  t  random  variables. 
The  distribution  of  this  linear  combination  is  known  as  the  Behrens-Fisher  distribution. 
A  Bayesian  analysis  with  a  noninformative  prior  leads  to  the  same  result.  The  Behrens- 
Fisher  distribution  can  be  evaluated  numerically  to  provide  a  test  which  has  size  a,  from 
the  standpoint  of  fiducial  probability,  for  all  0;  but  this  test  is  not  accepted  in  the  Neyman- 
Pearson  framework  (e.g.,  Wallace,  1980). 
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In  the  Neyman-Pearson  framework,  a  test  of 

Ho  •  6  fio 

which  achieves  a  nominal  size  a  for  all  w  6  flo  *s  said  to  be  similar.  A  similar  test  for 
the  Behrens-Fisher  problem  must  achieve  a  fixed  size  a  for  all  values  of  the  nuisance 
parameter  0 ,  and  such  a  test  does  not  exist.  To  be  precise,  a  critical  value  statistic  which 
results  in  a  similar  test  does  not  exist  if  nt  and  n2  are  of  the  same  parity,  and  any  critical 
value  statistic  for  sample  sizes  of  opposite  parity  must  be  a  function  with  infinitely  many 
discontinuities.  This  result  was  proved  by  Linnik  and  others  in  the  1960s  (see  Pfanzagl, 
1974),  and  it  was  suspected  to  be  true  by  many  for  years  before.  However,  by  the  time 
the  Linnik  results  became  available,  much  progress  had  been  made  toward  a  practical 
solution  from  a  frequentist  perspective  (e.g.,  Kendall  and  Stuart,  1977,  Vol.  2,  Chapter 
21). 

5.2  The  Trickett- Welch  Approach 

Welch  (1947)  and  Aspin  (1948)  tacitly  assume  the  existence  of  a  continuous  critical  value 
statistic  va(R)  such  that 

P(U  <  va{R))  =  l-a  (5.5) 

for  all  6 ,  where 

R  =  (5.6) 

S\/ni  +  Sf /n2 

is  a  sample  estimate  of  0,  and  the  probability  is  determined  assuming  that  the  null  hy¬ 
pothesis  is  true.  They  proceed  to  calculate  an  asymptotic  series  for  va  including  terms 
of  0(1  Km  -  l)4).  This  series  provides  a  test  which  is  very  nearly  similar  for  all  but  very 
small  sample  sizes.  Of  course,  in  the  light  of  Linnik’s  results,  it  should  not  be  surpris¬ 
ing  that  no  bound  was  given  by  Welch  and  Aspin  for  the  distance  between  their  series 
approximation  and  a  solution  to  (5.5). 

For  ni  and  n2  less  then  about  seven,  this  asymptotic  series  is  not  adequate,  and  so 
Trickett  and  Welch  (1954)  consider  an  alternative  numerical  approach.  Equation  (5.5) 
can  be  written  as  an  integral  equation,  and  Trickett  and  Welch  do  so  by  conditioning  on 
the  variance  estimates  and  averaging  over  their  distributions.  We  will  derive  this  integral 
equation  next,  using  a  somewhat  simpler  approach. 

5.2.1  The  Trickett- Welch  Equation 
The  sample  means  can  be  written  as 

X]  =  +  Zjaj/y/n~,  (5.7) 

where  the  Z}  are  iid  N(0,1),  and  j  =  1,2.  So, 

A'i  -  Xi  -  (/i,  -  fi2)  +  Z^o\lnx  +  <r|/n2,  (5.8) 

for  Z3  ~  N(0, 1),  and 

z=  Al~*2 - S~N(«,1),  (5.9) 

Jaf/ni  +  a\/n2 
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with 


Mi  -  M2 


(5.10) 


6  = 


\Jo\/nx  +  o|/n2 


would  be  a  natural  test  statistic  to  use  if  we  knew  the  variances  of  and  of.  Not  knowing 
these,  we  use  the  estimate 

Xx-X2 


U  = 


where 


and 


Vi  = 


V2  = 


\JS\/nx  +  S2/n2 
V'S\  .  ..2 


°\ 

v-iS\ 


ii/j 


2  2 
~  X 


*'2’ 


with  Uj  =  rij  -  1,  are  independent  of  each  other  and  of  Z.  Let 

Wb"J$.  +  !2$.  =  v,  +  v1, 


(5.11) 

(5.12) 

(5.13) 

(5.14) 


and 

and  note  that 

U  = 


VssfS£ 

W  ' 


(*i  -  X2)l^o\ln j  +  of/n2 


yJo\Vxl{nxvx)  +  ofvyfn^/^of/n,  +  <r|/n2 


=  Z 


W,  ,  (i-fl)K2\-l/2 


V  "i 


l/2 


V 


=  z  {w[0y/„,  +  (i  -  0)(i  -  Y)!U2  ]}-1/2 , 


s/W/(vx  +  u2) 


{(I'l  +  y2) 


ye  ,  (i - y)(i - 0)ii _1/2 


—  + 
L  v\ 


v2 


]}' 


(5.15) 


(5.16) 


where  Z,  Vj,  and  V2  are  independent  random  variables  with  distributions  independent  of 
the  parameters. 

It  is  well  known,  and  easy  to  show,  that 


is  independent  of 


W  ~  y2  . 

**'l+*'2 


Y  ~  Beta(t/i/2,t/2/2). 


(5.17) 


(5.18) 


We  can  see  from  (5.9)  and  (5.16)  that  under  H o  the  distribution  of  U  depends  on  the 
parameters  fix,  /r2,  ax,  and  o2  only  through  the  nuisance  parameter  0,  so  it  is  natural  to 
use  a  test  of  the  form:  ‘‘Reject  Hq  if  U  >  v0{R)\  where 


R  = 


Sx/nx 


YB/u  i 


Sx/nx  +  S|/n2  Y9/vx  +  (1  -  K)(l  -  0)/i* 


=  4>{B,Y) 


(5.19) 


93 


is  an  estimator  of  8.  If  Ho  is  true,  then 


■  I'j+i'?  > 


(5.20) 


that  is,  T  has  a  t  distribution  with  v\  +  u2  degrees  of  freedom.  Hence  the  events 

{U  <  va(R)}  (5.21) 


in 


(5.22) 


+  (5,2) 

are  equivalent. 

Therefore  we  want  to  find  va(R)  so  that,  for  all  8 , 

Pe[U  <  va(R)}  =  E  jl^+^  va(R)  [(*,  +  v2)  +  (1  |,/2J  J  =  l  _  a, 

(5.23) 

where  £{•}  denotes  the  expectation,  under  the  null  hypothesis,  of  an  expression  depending 
only  on  one  random  variable  Y ,  which  has  a  Beta  distribution,  and  where  is  the 

cumulative  of  the  t  distribution  with  t/j  +  v2  degrees  of  freedom.  Thus  our  expectation 
can  be  represented  as  an  integral  of  the  form 


(l-y)(l-0) 


111— • 


[  k{8,y,vQ[<t>(6,y)]}dy. 
Jo 


We  write  the  Trickett- Welch  equation  in  full  as 


[  k{8,y,va[(f>(8,y)])dy  =  g(8)  =  1  -  a, 
Jo 


where 


4>{Q,y)  = 


_ yflM _ 

yd!vx  +  (i  -  y)(i  -  8)/u2' 


(5.24) 


(5.25) 


(5.26) 


k(8,y,va)  =  Beta  (y;i/i/2,i/2/2)  (5.27) 

^  f  m  f,  "  1 

•T„,+„2  <  va[<l)(8,y)}J(t/i  +v2)  1^—  + - — -  >  , 

B«a(y;„/W2)  ,  -  y)^-' , 

(5.28) 

and 

TM)  S  !-oc  r (l/2)v/^(1  +  l2/^)_(‘/+,)/2dx.  (5.29) 

Note  that  t>0(r)  is  the  same  function  of  the  deterministic  argument  r  that  va(R)  is  of  the 
random  variable  R. 

We  will  sometimes  use  functional  notation  and  write  (5.23)  as 

E(va)  =  l-o.  (5.30) 

This  is  equivalent  to  the  integral  equation  (5.25),  which  Trickett  and  Welch  solve  numer¬ 
ically. 
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5.3  Quasi-Newton  Methods  and  the  Trickett- Welch  Algo¬ 
rithm 

Trickett  and  Welch  approximate  a  solution  to  (5.23)  by  using  a  quasi-Newton  iterative 
algorithm.  In  Chapter  4,  we  introduced  iterative  algorithms  for  nonlinear  integral  equa¬ 
tions  in  general.  In  this  section,  we  discuss,  in  the  context  of  the  Behrens-Fisher  problem, 
the  quasi-Newton  algorithm  which  Trickett  and  Welch  used,  as  well  as  a  Conditional 
Expectation  algorithm. 


5.3.1  Newton’s  Method 

We  will  begin  examining  the  application  of  iterative  algorithms  to  (5.23)  by  considering 
what  Newton’s  method  is  for  this  problem.  Assume  that  va  solves  (5.23),  and  expand  to 
first  order  about  an  approximate  solution  t>°.  Using  functional  notation,  we  have 

F(va)  =  1  -  a  *  F(v0o)  +  F'(v°a)(va  -  t£),  (5.31) 

where  F'(v^)  is  the  Frechet  derivative  of  F  evaluated  at  t>®. 

If  we  regard  (5.31)  as  an  equality,  and  solve 

F'(v°a)h  =  F'(v°q)(vq  -v°a)  =  l  -  a  -  F(v°a)  (5.32) 

for  h,  then  we  will  be  able  to  take  a  Newton  step.  Equation  (5.32)  is  equivalent  to 


E  |  h(R)^J(i'  j  +  v2) 


OY  ,  (l-0)(l_y) 


u2 


) 


(5.33) 


v°a(R)\J(^  1  +^2)  (7^-  + 

1  -a-F(v°a), 


■  T1 

x  1/1  +i>i 


0Y  (i-0)(i-r)' 


i/2 


)= 


where  T'  denotes  the  t  density  with  v  degrees  of  freedom. 

Since  Newton’s  method  is  quadratically  convergent  (when  it  does  converge)  in  Banach 
space,  for  v®  ‘close  enough’  to  a  solution,  one  might  think  that  Newton’s  method  would 
be  a  good  choice  for  this  problem.  However,  as  discussed  in  Chapter  4,  we  must  keep  in 
mind  that,  although  (5.33)  is  a  linear  integral  equation,  it  is  ill-posed  and  difficult  to  solve. 
Also,  the  Behrens-Fisher  problem  has  either  none  or  else  only  pathological  solutions,  so 
we  have  no  reason  to  expect  that  Newton’s  method  will  work  well,  even  when  applied  to 
a  discretized  problem.  It  turns  out  that  more  conservative,  linearly  convergent  iterative 
algorithms  perform  quite  well  for  this  problem. 

5.3.2  Quasi-Newton  Procedures 

We  will  suggest  two  simple  algorithms  based  on  approximating  the  Newton-step  equa¬ 
tion  (5.33).  The  first  approximation  is  used  by  Trickett  and  Welch  and  is  adequate  for 
the  Behrens-Fisher  problem.  The  second  approximation  is  a  form  of  the  Conditional  Ex¬ 
pectation  algorithm  of  Chapters  3  and  4.  There  are  heuristic  reasons  (Section  3.2)  to 
suspect  the  Conditional  Expectation  algorithm  to  be  an  improvement  over  the  original 
Trickett- Welch  procedure,  however,  for  the  Behrens-Fisher  problem,  there  is  virtually  no 
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difference  between  results  obtained  using  the  two  procedures.  We  have  used  the  Condi¬ 
tional  Expectation  algorithm  in  the  calculations  below,  but  the  simpler  Trickett- Welch 
algorithm  results  in  the  same  iterates  to  several  significant  figures. 


The  Trickett- Welch  Algorithm 

Figure  5.1  show  a  contour  plot  of  a  typical  kernel  for  the  Newton  step  equation  (5.33). 
To  be  specific,  we  have  taken  nj  =  20,  n 2  =  10,  vQ  equal  to  the  constant  1.65,  and 
a  =  .05.  We  refer  to  these  contours  as  typical  since  the  shape  of  the  kernel  does  not 
depend  strongly  on  va.  The  effect  of  «i  and  n2  on  the  kernel  is  primarily  limited  to  the 
location  and  sharpness  of  the  peak  -  the  contours  remain  nearly  straight  vertical  lines  over 
a  wide  range  of  n\  and  T12.  Also,  for  most  applications,  it  is  sufficient  to  consider  a  in  the 
range  .01  <  o  <  .10,  and,  over  this  range,  the  shape  of  the  kernel  remains  qualitatively 
similar. 

For  the  example  with  ii\  =  20  and  n2  =  10,  the  variance  of  Y  is  small,  so  the  kernel 
is  sharply  peaked  in  y  (and  the  location  of  this  peak  is  almost  independent  of  9).  The 
mean  of  Y ,  v\l(y\  +  1/2),  is  near  the  mode  of  the  density  of  Y  and  is  indicated  by  the 
broken  line.  The  effect  on  the  kernel  of  changes  in  va  of  the  magnitude  which  occur  in 
practice  does  not  significantly  effect  the  conclusion  that  the  kernel  is  generally  sharply 
peaked  near  y  =  v\/{v\  +  u2).  Trickett  and  Welch  note  this  fact,  and  use  it  to  motivate  a 
quasi-Newton  procedure. 

Since  the  kernel  has  a  peak  in  y  which  does  not  depend  very  much  on  9 ,  Trickett  and 
Welch  replace  the  argument  of  the  expectation  in  (5.33)  by  the  value  that  this  function 
assumes  when  Y  =  v\f{v\  +  1/2).  If  we  let  y.  =  v\/(u\  +  v 2),  then  the  Trickett- Welch 
approximation  is 


h[<f>(9,Y)]J(v r+«^)( 


9Y  |  (l-g)(l-F)> 

"1  "2 


(5.34) 


.  T' 

■LVl+l/2 


Va  [4>{9,Y  )]  +I'2)(^-  + 

«  Ji[4>(0,ym)}^j{u  1  +  t/2)  + 


oy  (i-0)(i-y) 


v2 


1) 


'0y.  ,  (l-0)(l-y.r 

"2 


•T' 

i'l+i'j 


<£[<H0.y*)]^(l'i  +  ^2)  ( 

=  mnl+U3[voa(0)}, 


0y.  +  (1  -  g)(l  -  y.y 

V\  l/2 


where  we  have  used 

and 


<t>(0,y.)  =  0, 


\J(vi  +  "2)  ( 


0y.  +  (1  -  g)(l  -  y.) 

V 1  V2 


)-*■ 


(5.35) 


(5.36) 


Since  h(0)Tl1+V3[v°(9)}  does  not  depend  on  the  Beta  random  variable  F,  it  can  be 
taken  out  of  the  expectation  in  (5.33).  Equivalently,  since  h(9)Tll+l^[v°(9)]  does  not 
depend  on  the  variable  of  integration  y,  it  can  be  taken  out  of  the  integrand  if  (5.33)  is 
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written  explicitly  as  an  integral  equation.  Having  made  this  approximation  in  (5.32),  we 
solve  for  an  h(8)  which  approximates  the  Newton  step  h(8): 


h(8)  w  h(8)  = 


1  ~  a  ~  F(v°a) 

^+„I«8W1  ‘ 


(5.37) 


Given  w®,  we  can  calculate  the  right  hand  side  of  (5.37)  numerically  for  any  values  of  8 
that  we  choose,  and  thereby  determine  the  approximate  Newton  step  h(8)  at  as  many 
points  as  we  like.  Since  8  takes  the  role  of  a  dummy  variable  in  (5.37),  by  determining 
h(8)  we  also  determine  h(r )  for  the  same  values  of  the  independent  variable  as  h(8).  We 
let  the  next  approximation  to  va  be 

vla(r)  =  t£(r)  +  h(r),  (5.38) 

where  we  use  an  interpolation  rule  in  order  to  get  functions  for  all  r  €  [0,1]. 


The  Conditional  Expectation  Algorithm 

We  now  apply  the  Conditional  Expectation  algorithm  in  the  form  (4.16),  that  is,  by  taking 
one  inner  iteration,  and  by  using  the  zero  function  as  the  initial  iterate  for  the  inner 
iteration.  Using  the  notation  of  Chapter  4  for  the  kernel  in  the  Newton  step  equation,  we 
write  (5.33)  as 

1  -  a  -  F(v°a)  =  I'  k'{8,y,v0a[<K^y)]}H<K^y))dy  (5.39) 

Jo 

We  know  that  k'{8,  y,  *£[<£(0,1/ )]}  has  a  peak  in  y  at  approximately  y.  =  V\/{ui  +  v2)  for 
all  8 ,  so  we  approximate  h[<f>(8,y)]  by 


h[<t>(8,y.)]  =  h(8), 

which  leads  to  the  Conditional  Expectation  method  quasi-Newton  step 

1  -  «  -  F(v°q) 


h(8)  x  h(8)  = 


Jo*  k'{8,y,v°[<t>(8,y.)}}dy 


(5.40) 


(5.41) 


In  the  next  section,  we  illustrate  the  Conditional  Expectation  algorithm  with  quasi- 
Newton  step  (5.41)  by  means  of  a  numerical  example. 


5.4  A  Numerical  Example 

In  Figure  5.2  we  present  the  result  of  applying  the  Conditional  Expectation  algorithm  for 
the  case  of  ni  =  20  and  n2  =  10.  We  have  chosen  values  of  the  nuisance  parameter  to  be 

0.  =  ^  for  t  =  l,...,m  +  1,  (5.42) 

where  m  =  24.  Integration  is  by  25  point  Gauss-Legendre  quadrature,  and  the  function 
is  interpolated  using  a  linear  spline  in  order  to  evaluate  F(r£)  numerically.  Appendix 
C  consists  of  the  function,  written  in  the  5  programming  language  (Becker,  Chambers 
and  Wilks,  1988),  which  was  used  to  interactively  perform  the  calculations. 

The  successive  approximations  v"  are  displayed  in  Figure  5.2,  and  the  successive  cal¬ 
culations  of  the  actual  size  (as  a  function  of  8)  for  a  nominal  size  of  a  =  .05  is  presented 
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in  Figure  5.3.  The  actual  size  is  calculated  at  the  25  nuisance  parameter  values  chosen 
for  the  discretization.  In  practice,  the  true  nuisance  parameter  will  be  between  two  of 
the  values  used  for  the  discretization,  so  the  numerical  demonstration  of  near  similarity  in 
Figure  5.3  is  a  bit  deceiving.  However,  when  the  actual  size  is  evaluated  for  other  nuisance 
parameter  values,  the  actual  size  is  found  to  be  still  virtually  equal  to  the  nominal  size. 

We  have  added  to  Figures  5.2  and  5.3  the  critical  value  and  actual  size  from  the 
commonly  used  Welch’s  approximate  t  method  (Welch,  1937;  Bickel  and  Doksum,  1977, 
p.  219),  that  is 

v(R)a  =  TfR)(  l -a),  (5.43) 

where  T"1  is  the  inverse  of  the  t  cumulative,  and  the  degrees  of  freedom  v  is  given  by 


v(R)  = 


R 2  1  (1  -R)2 

n\  -  1  ri2  -  1 


(5.44) 


and  R  is  the  nuisance  parameter  estimate  (5.6).  Although  the  Conditional  Expectation 
results  are  outstanding,  the  simple  approximate  t  method  also  provides  a  nearly  similar 
test. 

Weh  h’s  approximate  t  is  certainly  easier  to  use  than  the  method  which  results  from 
‘solving’  the  Trickett- Welch  integral  equation,  and  for  most  applications  the  approximate 
t  provides  a  test  that  is  as  near  to  being  similar  as  is  necessary.  Rut  the  Conditional 
Expectation  method  can  serve  another  purpose,  even  if  it  is  not  used  very  often  in  prac¬ 
tice.  Ad-hoc  approximations  such  as  Welch’s  t  are  often  compared  on  the  basis  of  their 
nearness  to  similarity,  and  also  power  (see,  e.g.,  Best  and  Rayner,  1987).  Measuring  the 
distances  between  an  ad-hoc  critical  function  and  a  critical  function  for  an  ‘exactly’  sim¬ 
ilar  test  provides  information  on  how  close  a  proposed  confidence  interval  comes  to  ‘the 
best  possible’  result. 

No  detail  can  be  obtained  from  Figure  5.3  except  for  the  first  few  iterations  because 
the  convergence  to  similarity  is  so  rapid.  Also,  it  is  difficult  to  infer  much  about  the  rate  of 
convergence  from  Figure  5.3.  In  Figure  5.4  we  see  the  distance,  in  the  L ^  norm  (maximum 
absolute  deviation)  ,  from  the  nominal  size  in  a  semilog  plot  against  the  iteration  number. 

We  can  see  from  Figure  5.4  that  the  rate  of  convergence  is  rapid  at  first,  and  then  even¬ 
tually  decreases  to  the  point  where,  after  twenty  or  so  steps,  it  hardly  seems  worthwhile 
to  continue.  Intuitively,  this  is  consistent  with  our  previous  discussion  of  the  convergence 
of  Richardson’s  algorithm.  After  the  components  of  the  initial  approximation  in  the  di¬ 
rections  of  the  dominant  eigenfunctions  decay,  the  ‘less  important’  eigenfunctions  remain, 
and  these  decay  much  more  slowly  since  they  correspond  to  sma’ier  eigenvalues. 

In  summary,  these  results  are  quite  spectacular.  Although  there  is  no  exact  solution 
to  the  Behrens- Fisher  problem,  we  are  able  to  easily  determine  a  smooth  critical  function 
for  which  the  distance  of  the  right  hand  side  from  1  -  a  is  less  than  10~6! 


5.5  The  Power  of  Tests  for  the  Behrens-Fisher  Problem 


If  the  null  hypothesis  is  not  true,  then  P(U  <  va(R))  is  the  power  function  for  a  test 
using  the  critical  value  va(R).  We  can  easily  obtain  an  expression  for  the  power,  using 
the  same  approach  as  in  Section  5.2.  Since  the  null  hypothesis  is  not  true,  we  obtair 


U  = 


Z  +  6 

v/IF/frii  +  n- 2) 


'Rn\  +r»2  (^) 


(5.45) 
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where  Tni+nj(6)  denotes  the  noncentral  t  cumulative  with  n\  +  n2  degrees  of  freedom  and 
noncentrality  parameter  6  given  by  (5.10)  instead  of  the  expression 


\JW!(n\  +  n2) 

which  appears  in  Section  5.2.  The  power  is 
n(6)e  =  1  -  E  jr„I+t,a  va(R)  |(nj  +  n2)  ^ 


t  ni  +T12  y 


(5.46) 


\w  (i  -  y)(i  -  e) 

V\  v2 


ini- 


(5.47) 


where  the  notation  x(6)$  indicates  that  the  power  is  a  family  of  functions  of  6,  indexed 
by  the  nuisance  parameter  value  9. 

We  calculate  (5.47)  numerically  next  as  a  continuation  of  the  numerical  example  of 
the  previous  section.  As  an  example  of  a  power  calculation,  we  compare  the  Conditional 
Expectation  procedure  with  Welch’s  approximate  t.  It  only  makes  sense  to  compare  the 
power  of  tests  which  have  the  same  size,  so  we  begin  by  examining  Figure  5.2  in  order 
to  determine  a  9  value  for  which  the  sizes  of  the  two  methods  are  nearly  the  same.  We 
thereby  choose  9  =  .35  for  the  value  of  the  nuisance  parameter,  and  calculate  the  size 
of  the  Conditional  Expectation  method  to  be  .05000073,  and  the  size  of  Welch’s  t  to 
be  .049928.  The  power  function  for  the  Conditional  Expectation  method  is  displayed  in 
Figure  5.5.  The  power  curve  for  Welch’s  t  is  not  graphed  in  Figure  5.5  since  it  would  not 
be  discernible  from  the  other  power  function.  The  difference  in  the  two  power  functions 
(times  1000)  is  displayed  in  Figure  5.6.  Note  that  the  maximum  difference  between  the 
powers  is  not  much  larger  than  the  (very  small)  difference  between  the  sizes.  It  seems 
that  the  test  derived  by  the  Conditional  Expectation  algorithm  achieves  near  similarity 
without  sacrificing  power  relative  to  Welch’s  approximate  t  test. 
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Figure  5.2:  The  Trickett-Welch  Approximations  to  the  Critical  Function 
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Figure  5.3:  Size  for  Trickett-Welch  Critical  Value  Function  Approxii 


L-1  dtattnatomnomMatn 
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Figure  5.4:  Convergence  to  Similarity  of  the  Trickett- Welch  Iterates 
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Figure  5.5:  Power  Function  for  Trickett-Welch  Test 
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Figure  5.6:  Difference  in  Power  Functions  ~  T.W.  vs.  Welch  t 
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Chapter  6 


One-Sided  Tolerance  Limits  for  a 

One-Way  Balanced 

Random- Effects  ANOVA  Model 


6.1  Other  Applications  of  Iterative  Algorithms 

The  Behrens-Fisher  problem  is  only  one  example  of  a  normal-theory  problem  with  a 
nuisance  parameter.  Other  examples  include:  confidence  intervals  for  the  common  mean  of 
two  normal  populations  and  one-sided  prediction  intervals  for  a  one-way  balanced  random 
effects  model.  The  (unmodified)  Trickett- Welch  algorithm  has  been  applied  successfully 
to  these  problems  by  Marie  and  Graybill  (1979)  and  Wang  (1988),  respectively.  In  fact, 
Marie  and  Graybill  apparently  independently  discovered  the  Trickett- Welch  algorithm. 

We  will  discuss  another  problem,  one  which  has  some  importance  for  applications  and 
which  was  the  starting  point  for  this  thesis.  This  problem  concerns  one-sided  confidence 
intervals  for  a  quantile  of  a  normal  population  with  two  components  of  variance  estimated 
using  data  from  a  one-way  balanced  random-effects  ANOVA  model. 

These  tolerance  limits  are  important  for  characterizing  the  strength  of  composite  ma¬ 
terials  (Mil-HDBK-17C,  1992),  and  there  was  concern  over  the  conservatism  of  an  approx¬ 
imate  procedure  for  this  problem  due  to  Mee  and  Owen  (1983).  Attempts  to  reduce  this 
conservatism  led  to  the  application  of  the  ideas  first  of  Welch  (1947)  and  later  of  Trickett 
and  Welch  (1954).  The  integral  equation  for  this  tolerance  limit  problem  is  substantially 
more  complicated  than  the  Trickett-Welch  equation  (5.25)  of  Chapter  5.  Consequently 
only  first  order  terms  of  the  Welch-Aspin  type  asymptotic  expansion  are  tractable,  and 
the  Trickett-Welch  algorithm  does  not  work  at  all.  However,  the  Conditional  Expectation 
algorithm  is  very  effective  on  this  problem.  In  addition  to  this  thesis,  this  work  is  reported 
on  in  Vangel  (1987,  1990,  1992). 


6.2  The  Tolerance  Limit  Problem 

Let  A  be  a  normally  distributed  random  variable  with  mean  n  and  variance  <r2  =  cr2  +  <r2 
.  A  lower  confidence  limit  for  a  quantile  of  this  population  (i.e.,  a  lower  tolerance  limit) 
is  to  be  determined  using  data  from  a  one-way  balanced  random  effects  ANOVA  sample 
with  between-group  and  within-group  variances  <r2  and  <r2  respectively. 
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For  example,  let  X  represent  the  strength  of  a  randomly  selected  specimen  of  a  ma¬ 
terial  manufactured  in  a  batch  which  can  be  considered  to  be  randomly  selected  from  a 
population  of  batches.  A  quantity  of  interest  to  aircraft  designers  is  the  ‘B-basis  value’, 
which  is  a  95  percent  lower  confidence  limit  on  the  tenth  percentile  of  the  distribution 
of  X.  For  this  situation,  it  is  important  that  nearly  the  nominal  coverage  probability  be 
attained  whatever  the  unknown  population  variance  ratio.  It  is  also  very  desirable  that 
the  calculated  limit  be  as  large  as  possible,  since  unnecessarily  low  values  cause  undue 
conservatism  in  design. 

We  discuss  below  techniques  for  determining  one-sided  tolerance  limits  for  A”  based 
on  a  random  sample  of  J  items  from  each  of  /  batches.  A  ($,7)  lower  tolerance  limit  is  a 
statistic  T  such  that  at  least  a  proportion  (3  of  the  population  is  covered  by  the  interval 
(T,  oo)  with  probability  at  least  7.  The  methods  developed  here  for  lower  tolerance  limits 
can  be  adapted  in  an  obvious  way  to  upper  limits.  We  will  refer  to  0  as  the  coverage  and 
7  as  the  confidence. 

This  problem  was  first  considered  by  Lemon  (1977)  who  proposed  an  approximate 
solution  too  conservative  for  most  applications.  Mee  and  Owen  (1983)  greatly  improved 
on  Lemon’s  results  by  using  a  Satterthwaite  (1947)  approximation.  Seeger  and  Thorsson 
(1972)  proposed  the  same  approximation  for  the  corresponding  two-sided  problem.  The 
Mee-Owen  method  is  reviewed  in  Vangel  (1990)  and  will  not  be  described  here.  Instead, 
we  will  regard  this  problem  as  a  typical  normal-theory  inverse  problem,  requiring  the 
solution  of  an  integral  equation,  and  apply  the  Conditional  Expectation  algorithm. 

First  we  shall  consider  the  case  where  the  nuisance  parameters  are  known.  Then 
we  shall  develop  a  Welch-Aspin  type  of  expansion.  The  latter  can  serve  as  an  initial 
approximation  for  the  Conditional  Expectation  algorithm. 


6.3  The  One-Way  Balanced  Random- Effects  Model 

Let  X ’{j  denote  the  jth  of  J  observations  from  the  ith  of  I  batches.  If  X{j  follows  a 
one-way  balanced  random-effects  model,  then 

^ij  ~  t1  +  +  eiji  (®’l) 

where  /i  denotes  the  population  mean,  n  +  6;  denotes  the  mean  of  the  ith  batch,  and 
e,j  is  the  error  term.  The  V s  and  the  e,/s  are  assumed  to  be  independently  distributed 
normal  with  mean  zero  and  variance  oj  and  respectively.  An  observation  X  from  this 
population  is  thus  normally  distributed  with  mean  /i  and  variance 

o\  =  a\  +  a],  (6.2) 

Let  n  =  IJ  denote  the  sample  size:  The  parameters  fi,  of  and  ol  of  the  random  effects 
model  can  be  estimated  by  the  pooled  mean  fi,  the  within  batch  mean  square  MSe,  and 
a  linear  combination  of  MSe  with  the  between  batch  mean  square  MS&  where: 


I  J  Y 

i=l ;=l 

(6.3) 

J 

J=1 

(6.4) 
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(6.5) 


By  analogy  with  the  single  sample  case  (see,  for  example,  Owen  (1968)),  we  seek  an 
estimator  of  the  form 


fr-kdx. 

(6.10) 

where  k  is  chosen  to  satisfy,  for  all  of  and  of. 

P{ft  -  kdx  <n-  z0ax)  =  7- 

(6.11) 

Since  fi  has  a  normal  distribution  with  mean  n  and  variance 

op  =  {Jol  +  <Te)/n, 

(6.12) 

we  can  rewrite  (6.11)  as 

(6.13) 

where 

(6.14) 

(6.15) 

and 

r  =  a2b/a]. 

(6.16) 

The  random  variable  (6.14)  has  a  N(0,1)  distribution,  and  is  independent  of  (6.7),  whose 
component  terms  are  independent. 
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6.4  An  Exact  Solution  for  Known  r 

For  a  simple  random  sample,  a  solution  to  the  one  sided  tolerance  limit  problem  is  readily 
obtained  in  terms  of  the  noncentral  t  distribution  (see,  e.g.,  Owen  1968).  If  one  assumes 
that  the  variance  ratio  r  is  known,  then  the  corresponding  problem  for  a  sample  from 
a  balanced  random  effects  model  can  be  solved  almost  as  easily.  What  is  required  is 
the  distribution  of  a  ‘generalized  noncentral  t'  random  variable,  a  generalization  of  the 
noncentral  t  to  a  random  variable  with  the  square  root  of  a  linear  combination  of  two 
X2  random  variables  in  the  denominator.  In  this  section,  we  derive  this  distribution,  and 
then  we  show  how  it  can  be  used  to  solve  the  tolerance  limit  problem  for  known  r. 

Let  the  random  variables  Z,  Y\,  and  Y2  have  the  following  distributions: 

Z~N(  0,1),  (6.17) 

and 

Yj~x  (6-18) 

for  j  =  1,2,  where  Z ,  Fj,  and  Y2  are  mutually  independent.  We  will  call  A  a  generalized 
noncentral  t  random  variable  if  A  has  the  form 


A  =  (n,  +  n2)1/2 


Z  +  6 

y/d\Y\  +  d2Y2 


(6.19) 


where  d\ ,  d2,  and  6  are  constants  with  d\  and  d2  positive. 

We  will  find  the  distribution  of  (6.19)  by  a  technique  very  similar  to  the  approach 
used  in  Section  5.2.1.  We  express  A  as  the  product  of  a  noncentral  t  random  variable 
times  an  expression  involving  only  known  constants  and  a  beta  random  variable.  Since 
the  two  terms  in  this  product  are  independent,  conditioning  on  the  beta  random  variable 
and  integrating  yields  the  distribution  of  A. 

The  random  variable  A  is  easily  seen  to  be  equal  to 


A  = 


Z  +  6 


•  Ti  +  y2  11/2 
,diY\  +  d2Y2, 


=  T 


\/(Ti  +  T2)/(«i  +  n2)  LdiTi  +  d2Y2. 
d\Y\  ,  d2Y2  t*/2 


Ti  +  Y2 
T 


Yi  +  y2j 


s/dxY  +  d2{  1  -  y )  ’ 


(6.20) 


where  T  has  the  noncentral  t  distribution  with  degrees  of  freedom  n\  +  n2  and  noncentrality 
parameter  6 ,  denoted  Tn,+n 3(^),  and  Y  has  the  beta  distribution,  with  parameters  n\/2 
and  n2/2,  denoted  Beta  (nj/2,  n2/2).  It  is  well  known  (e.g.,  Fleiss,  1971)  that  Y  has 
the  claimed  beta  distribution,  and  that  L  =  Y\  +  Y2  is  independent  of  Y.  Since  Z  is 
independent  of  Y},  hence  of  Y ,  by  assumption,  it  follows  that  T  and  Y  are  independent. 

By  conditioning  on  Y,  we  see  that 


FA(t)  =  P(A  <  t)  =  P 


T  <  tyJd\Y  +  d2(l  -  Y) 


^  +  r»2  (y^y  +  rf.a-y),*)], 


(6.21) 
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where  Tj(t,6 )  denotes  the  noncentral  t  cumulative  distribution  with  /  degrees  of  freedom 
and  noncentrality  parameter  6,  that  is 


Tf(t,6) 


(6.22) 


where  Cj  denotes  the  x2  density  with  /  degrees  of  freedom  and  $(•)  is  the  standard 
normal  distribution.  Thus,  FA(t)  can  be  expressed  as  an  integral  of  a  function  of  y,  i.e. 
the  argument  of  T  in  (6.21)  times  the  beta  density.  Fa  is  a  distribution  with  argument  t 
and  implicit  parameters  nj,  n2,  d\,  d2,  and  6. 

For  the  tolerance  limit  problem,  we  have  using  (6.13)  that, 


/  Z  +  s/n  zpb 
V  ox/ox 


Z  +  y  ftz0b  <  k 
y/nbax/ox 


=  7- 


(6.23) 


Let  =  7-1  and 
freedom,  respectively, 
variables,  we  see  that 


n2  =  I(J  -  1):  the  between-group  and  within-group  degrees  of 
Since  the  mean  squares  MS;,  and  MSe  are  proportional  to  x2  random 

MS  b  =  (Jal  +  a*)Yl/nu  (6.24) 

MSe  =  <r2eY2/n2 ,  (6.25) 


and 

4  =  (°b  +  *2e/J)Yi/n ,  +  (1  -  1  /J)a2Y2/n2.  (6.26) 


Simple  algebra  now  leads  to 


-2  Y2 


7-1  Jr+T 

where  r  is  the  variance  ratio,  r  =  tr2/a2. 

If  we  let 

j  _  (»*i  +  n2)I 
d  i  -  — ; — ; — . 


and 


d2  = 


7-1 
ni  +  n2 


,  .  1  » 

Jr  +  1 

then,  for  these  specific  values  of  n\,  n2,  d\,  and  d2,  with  r  known,  we  have  that 


(6.27) 


(6.28) 

(6.29) 


P(fi-  kax  <  /i  -  zpox)  =  FA{k)  (6.30) 

and 

6  =  zpby/n  =  |  -  (6.31) 

A  constant  or  function,  such  as  the  constant  k  in  (6.30),  which  leads  to  a  tolerance  limit 
is  called  a  tolerance  limit  factor.  The  value  fc(r)  of  k  such  that  FA{k)  =  7  thus  provides 
an  exact  solution  to  the  problem  for  known  variance  ratio  r,  where  FA(k)  depends  on  <r2 
and  o\  only  through  r. 


Ill 


Later  we  will  consider  the  case  where  the  constant  k  is  replaced  by  a  function 

c  =  c(MSt,MSe). 


In  that  case,  t  in  (6.21)  can  be  replaced  by  c  as  long  as  c  can  be  represented  as  a  function 
of  the  mean  square  ratio: 

(632) 

The  result  is  an  integral  equation  depending  on  o2  and  o\  only  through  r. 


6.5  The  Solution  for  Unknown  r:  A  Welch- Aspin  Type 
Asymptotic  Expansion 

For  unknown  variance  ratio,  the  tolerance  limit  problem  is  closely  related  to  the  Behrens- 
Fisher  problem.  Since  it  is  well  known  that  there  is  no  ‘well  behaved’  solution  to  the 
Behrens-Fisher  problem,  it  is  likely  that  a  tolerance  limit  factor  for  which  the  correspond¬ 
ing  tolerance  limit  has  exactly  the  nominal  confidence  for  all  r  does  not  exist.  However, 
we  can  proceed  as  if  a  tolerance  limit  factor  does  exist  and  attempt  to  approximate  it. 
Following  the  work  of  Welch  (1947),  Aspin  (1948),  and  Trickett  and  Welch  (1954),  we  will 
propose  three  tolerance  limit  factors,  which  we  will  sometimes  refer  to  as  ‘solutions’. 

The  first  solution  discussed  is  based  on  an  asymptotic  expansion  (for  large  /  and  J)  of 
the  type  considered  by  Welch  and  Aspin;  we  will  call  this  solution  the  Asymptotic  Expan¬ 
sion  tolerance  limit  factor.  While  computationally  simple,  the  first  order  approximation 
presented  here  is  anticonservative  and  may  only  be  suitable  for  many  batches. 

We  could  improve  this  procedure  by  taking  higher  order  approximations.  However, 
this  becomes  very  tedious  to  carry  out.  Instead,  we  propose  an  ad-hoc  modification  to 
the  Asymptotic  Expansion  tolerance  limit  factor  which  is  very  easy  to  use  and  which  is 
adequate  for  most  applications.  We  will  refer  to  this  result  as  the  Modified  Asymptotic 
Expansion  tolerance  limit  factor.  In  Section  6.6,  the  tolerance  limit  factor  as  a  function  of 
the  mean  square  ratio  will  be  obtained  approximately  as  a  solution  of  an  integral  equation 
by  means  of  the  Conditional  Expectation  algorithm.  The  Conditional  Expectation  toler¬ 
ance  limit  factor  which  results  provides  confidence  extremely  close  to  the  nominal  level  for 
all  values  of  the  nuisance  parameter:  even  for  very  small  sample  sizes. 

To  simplify  the  notation  in  what  follows,  let  S 2  be  the  mean  squares,  of  their  expected 
values,  and  n,  the  associated  degrees  of  freedom  for  i  =  1,2,  i.e.  : 

5,2  =  MSi,  o  ?  =  Jct^  +  o2,  «i=/-1, 

S\  =  MSe,  a\  =  <r2,  n2  =  I(J  -  1). 

The  pooled  sample  size  is  n  =  IJ  and  the  population  variance  is  denoted  by 

O2  =  ct2x  =  a2  +  a]  =  o\l  J  +  o2(  1  -  1/J),  (6.33) 

and  estimated  by 

S2  =  b\  =  S2/J  +  S22(l  -  l/J).  (6.34) 

The  subscript  A'  for  the  population  variance  and  the  estimate  of  this  variance  will  be 
omitted  for  the  remainder  of  this  section. 

We  will  consider  tolerance  limits  of  the  form  fi  —  kdx ■  If  the  variance  ratio  r  were 
known,  then  the  factor  k(r)  determined  from  the  generalized  noncentral  t  distribution 


112 


in  Section  6.4  would  be  appropriate.  Since  r  is  not  known,  but  can  be  estimated  as  a 
function  of  and  5|,  we  will  replace  k  by  c(5j,5|).  We  will  call  c  the  tolerance  limit 
factor,  and  we  define  hlS^S*)  to  be  cb. 

The  tolerance  limit  corresponding  to  this  factor  c  can  be  expressed  as  an  expectation 
with  respect  to  the  distributions  of  the  mean  squares  in  terms  of  the  standard  normal 
distribution,  so  that  (6.11)  becomes 


where  as  above 


7  =  P((i  -  ca  <  fi-  zpff) 


6  =  zp\ 


/»(*•+!) 

Jr  +  1 


zpo 

(Tl/y/n' 


(6.35) 


(6.36) 


The  problem  is  to  determine  a  function  /i(5j,5|)  so  that  (6.35)  is  approximately 
satisfied  for  all  Oj  and  a\.  If  tolerance  limits  on  the  median  are  desired,  then  6  —  0 
and  the  results  of  Welch  (1947)  and  Aspin  (1948)  can  be  used  directly.  If  6  is  not  zero, 
the  idea  behind  the  Welch-Aspin  derivation  can  still  be  applied,  although  the  algebra  is 
considerably  messier. 

The  Welch-Aspin  approach  makes  use  of  differential  operators  in  order  to  develop  an 
asymptotic  expansion  for  h  (for  large  I  and  J).  The  same  approach  can  be  used  here, 
but  for  first  order  calculations  the  algebraic  simplifications  which  result  are  insufficient 
to  justify  the  additional  formalism  which  the  operator  technique  requires.  Hence,  the 
discussion  below  consists  of  a  straightforward  Taylor  series  derivation.  Of  course,  both 
methods  must  give  the  same  answer,  and  this  has  been  used  to  provide  a  check  on  the 
calculations. 

We  begin  by  rewriting  (6.35)  as 


E[$(zy  +  t/)]  =  7, 


where 


Z-y  +  U  = 


_  h(SlSl) 


-6. 


o i/v/n 

We  then  expand  h  in  a  series  of  inverse  half-powers  of  n,: 

h  —  ho  +  h\  -f-  o{I  +  J  ^2). 

Up  to  terms  of  second  order  in  h  we  have  that 

v  hi(SlSl)  {  h0(SlSp  z0o 

Ol/y/n  Ol/y/n  (Ti/y/n  ’Y’ 

where  we  have  substituted  (6.36)  for  6. 

For  the  zeroth  order  approximation  we  approximate  /i(52,52)  by 


h0(S*,Sj)  «  hQ{o},al) 


(6.37) 

(6.38) 


(6.39) 


(6.40) 
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and  we  have,  £7  =  0  and 


ho(a{,<rl)  ~  Z0<r 

o\!y/n 


—  27  =  0 


or 


*0  (sf.sf)  »£&  +  «,*. 

Vn 

For  the  first  order  expression,  we  approximate  h  by 

ho(S2, 5|)  +  h,(5?,5|)  »  hoC*?2*  *£?)  +  hi(<r2  ^tf2)  = 


then 


where 


and 


£7  -  ZyUi  +  z0U2  + 


Ui  =  —  -  1 


£7,  = 


*(?->)• 


2  <J\ly/n 

Let  Yi  denote  a  x2  random  variable  with  n,  degrees  of  freedom  and  define 

V  =22.! 

Vn,  —  -*■ 


m 


for  t  =  1,2.  The  £7,  can  be  expressed  in  terms  of  the  Vn,  as  follows: 
£7,  =  (l  +  Vn,)1/2-l, 


£7,  = 


Oi/y/n 


(6.41) 

(6.42) 


(6.43) 

(6.44) 

(6.45) 

(6.46) 

(6.47) 

(6.48) 

(6.49) 


After  expanding  the  square  roots  in  (6.48)  and  (6.49)  in  power  series,  one  can  readily 
obtain  approximations  to  the  first  two  moments  of  the  £7,  suitable  for  first  order  calcula¬ 
tions: 


£(£7.)  *  -JL. 
E(U?)  *  TT-, 


E(U2) 


2nj 

1  fai_  02 
4  [ni  tt2, 


-  s^/5E  +  5. 


and 


E(UiU2)  ~ 


<ri  1 

i/o/v/n  2ni  ’ 


(6.50) 

(6.51) 

(6.52) 

(6.53) 

(6.54) 
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where 


~  03  J2 


(6.55) 


and 


—  ^2  (■,  t— 1  \2_l/2 

fl2  =  ^  )  n  * 
£Tj£T 


(6.56) 


The  next  step  is  to  expand  the  normal  cdf  about  z so  that  (6.37)  can  be  replaced  by 
the  following  approximation: 


£?[*(*, +  10]  =  7  *♦(*») 

+<t>(z^)E(U)  -  z^<j>(z^E(U2)/ 2, 


(6.57) 


where  <£(•)  denotes  the  standard  normal  density.  The  expectation  of  U  can  be  determined 
immediately  from  (6.44),  (6.50)  amd  (6.52).  Since 

E(V2)  *  z2E(U2)  +  z20E(U2)  +  2 Z0^E(lhU2),  (6.58) 


we  need  only  substitute  (6.51),  (6.53)  and  (6.54)  into  (6.58)  in  order  to  complete  the 
evaluation  of  (6.57). 

To  complete  these  calculations,  solve  (6.57)  for  hi(c\,a\)  (note  that  hx  appears 
through  E(U)),  replace  each  occurrence  of  crj  or  a1  with  S2  or  S2  respectively  ( i  =  1,2) 
and  divide  h^S^Sl)  by  5  to  finally  obtain  the  tolerance  limit  factor  c.  The  terms  of 
c  =  ( ho  +  hi)/S  can  then  be  rearranged  to  reveal  their  structure.  The  following  expression 
for  the  Asymptotic  Expansion  tolerance  limit  factor  c  is  one  possibility: 


where 


and 


.  zyW  ,  W  [*,(*?  +  !) 

£  —  f~r  a  n 

\fl  \\fl  n\ 

|  2 zrfy/lW  |  zfalW2 

ni  ni 

zpsTlW3  zfaljj -\)2W2 
+  n\  +  niQ2 

.  z0(J  -  l)2v/7W3' 

+  n2Q2 


W  =  (1  +  (J-1)/Q)-V2 


(6.59) 


(6.60) 

(6.61) 


The  confidence  for  the  above  approximation  as  a  function  of  the  population  variance 
ratio  is  plotted  in  Figure  6.1  for  a  (.90,. 95)  tolerance  limit  and  J  =  5.  Note  that  for  many 
batches  this  solution  performs  well,  though  for  few  batches  it  is  anticonservative. 
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Table  6.1:  Range  in  Actual  Confidence  for  Approximate  Tolerance  Limit  c* 


/ 

J 

(3,i) 

( .90. .95) 

(.99, .95) 

(.99, .99) 

3 

2 

.929 

.962 

.931 

.962 

.970 

.993 

3 

5 

.921 

.962 

.914 

.962 

.954 

.992 

3 

10 

.927 

.962 

.922 

.962 

.956 

.992 

5 

2 

.942 

.960 

.940 

.960 

.981 

.993 

5 

5 

.944 

.962 

.945 

.962 

.980 

.993 

5 

10 

.950 

.962 

.950 

.963 

.982 

.993 

10 

2 

.950 

.958 

.950 

.958 

.989 

.992 

10 

5 

.950 

.960 

.950 

.960 

.990 

.993 

10 

10 

.950 

.964 

.950 

.971 

.990 

.994 

6.5.1  A  Simple,  Accurate  Tolerance  Limit  Factor  Based  on  an  Asymp¬ 
totic  Expansion 

The  following  two  steps  lead  to  an  improved  tolerance  limit  factor  based  on  equation 
(6.59).  First,  omit  the  terms  in  (6.59)  which  are  proportional  to  1/Q2,  since  these  are 
singular  at  Q  =  0  and  are  very  small  for  it  *  lerate  to  large  Q.  What  remains  is  a 
polynomial  in  W.  The  random  variable  z^sflfW  estimates  the  noncentrality  parameter 
6  defined  in  (6.36),  so  a  polynomial  in  W  is  a  polynomial  in  powers  of  estimates  of  the 
reciprocal  of  6. 

There  are  many  ways  to  choose  the  coefficients  and  terms  of  a  polynomial  in  W  so 
as  to  provide  approximate  tolerance  limit  factors  with  good  properties.  The  following 
approximation  performs  remarkably  well,  considering  its  extreme  simplicity: 


c*  =  /  ~  uj/y/J  +  (u/  -  u/j)IV]/(1  -  l/y/J)  for  Q  >  1 

-  \  uij  for  Q  <  1 


(6.62) 


where  u*  denotes  the  corresponding  tolerance  limit  factor  for  a  simple  random  sample  of 
size  l.  We  will  refer  to  c*  as  the  Modified  Asymptotic  Expansion  tolerance  limit  factor. 

As  Q  — *  oo,  W  — » ■  1  and  c*  — *  uj.  If  Q  =  1,  then  the  variance  estimate  (6.12) 
is  equal  to  the  pooled  sample  variance.  Since  c*  =  ujj  when  Q  =  1,  the  approximate 
tolerance  limit  for  the  random  effects  model,  using  c"  with  Q  =  1,  will  exactly  equal  the 
corresponding  simple  random  sample  tolerance  limit  factor  for  the  pooled  data.  If  Q  <  1, 
then  we  take  c*  to  equal  u/j  so  that  the  random  effects  tolerance  limit  factor  will  never 
be  less  than  the  tolerance  limit  factor  corresponding  to  a  simple  random  sample  of  size 
IJ.  Truncating  Q  in  this  way  is  reasonable  since  Q  estimates  Jr  +  1,  which  cannot  be 
less  than  one. 

For  any  sample  size,  therefore,  c*  will  provide  a  tolerance  limit  which  is  exact  in  the 
limit  of  large  r  and  conservative  (because  of  the  requirement  that  c*  not  exceed  u/j)  for 
r  near  zero.  For  intermediate  r,  this  tolerance  limit  can  be  anticonservative,  although  the 
anticonservatism  is  not  prohibitive,  except  possibly  for  very  few  batches.  In  Table  6.1, 
the  range  in  the  actual  confidence  of  the  tolerance  limit  factor  (6.62)  is  given  for  selected 
values  of  (3,  7,  /  and  J. 
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6.6  The  Conditional  Expectation  Tolerance  Limit  Factor 


For  small  samples,  the  first  order  approximation  developed  above  may  not  be  adequate, 
and  higher  order  calculations  are  clearly  prohibitive.  An  alternative  approach  to  be  dis¬ 
cussed  next  is  to  formulate  the  problem  as  an  integral  equation,  and  iteratively  improve 
on  the  first  order  approximation  numerically. 

It  is  convenient  to  transform  from  the  parameter  r  to 


T  =  Jr+l  =  ^. 


Then 


P(fi  -  kbx  <H-  zpox)  = 

E 


(6.63) 

(6.64) 


where 


T’nj+nj  ^(r)(n!  +  +  i— ^,^(r)j  = 

6(r)  =  y/nz0b  =  z0^Jl  ^1  + 


(6.65) 


Y  is  a  beta  random  variable  with  parameters  n\/2  and  nj/2,  and  b  is  defined  in  (6.15). 
The  parameter  r  can  be  estimated  by  the  sample  variance  ratio  (6.32): 


n  - 

si  n,>n2_  n,(i  -vy 


(6.66) 


where  we  use  Fni<ni  to  denote  a  random  variable  having  an  F  distribution  with  nj  and 
«2  degrees  of  freedom. 

If  we  seek  a  tolerance  limit  factor  of  the  form 

c(52,S2)  =  i;(Q),  (6.67) 

the  remark  at  the  end  of  Section  6.4  indicates  that  we  seek  a  solution  v(Q)  of  the  integral 
equation 


VT(v)  =  E 


T’n.+na  ^(Q)(*l  +  "2)1/2y9^  + 


=  7. 


(6.68) 


where  the  expectation  is  with  respect  to  the  beta  density  of  Y. 

In  Section  6.5,  we  derived  two  approximations  to  v(Q),  either  of  which  we  label  here 
v°(Q).  We  will  improve  on  this  approximation  by  using  the  Conditional  Expectation 
algorithm.  Let  the  approximation  at  the  nth  iteration  be  denoted  vn(Q)  and  define  the 
iteration 

vn+l  =  vn  +  (6 ,69) 

where  the  quasi-Newton  step  is  an  approximation  to  the  solution  V>n  of  the  Newton-step 
equation 

7  -  VT(vn)  = 

1  -  Y 


rmn,  +  Jj^  +  i-jr 


(6.70) 
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where  T'1+nj(-,  •)  denotes  the  noncentral  t  density  and  VT(  )  is  given  in  (6.68). 

The  noncentral  t  density  with  /  degrees  of  freedom  and  noncentrality  parameter  6  can 
be  calculated  by  means  of  the  following  formula  (Odeh  and  Owen,  1980,  p.  272): 


(6.71) 


Since  there  are  computer  subroutines  available  for  determining  the  noncentral  t  cdf  (see, 
e.g.,  Griffiths  and  Hill,  1985),  (6.71)  is  very  useful  for  computation. 

Using  the  shorthand  notation  of  Chapter  4,  we  write  (6.68)  as 


where 


[  k{r, y, u"[d»(r,  y)]}dy  =  y(r)  =  7, 
Jo 


x_  ™2y 

4>(r,y)  =  —r. - r. 

«i(l  -  y ) 


(6.72) 


(6.73) 


We  also  rewrite  the  Newton-step  equation  (6.70)  as 


7-/  k{T,y,vn[4>(T,y)]}dy  =  f  k’{T,y,vn[<f>(T,y)])ij>n[<t>{T,y))dy.  (6.74) 

Jo  Jo 

The  kernel  of  the  integral  equation  (6.72)  and  the  derivative  of  this  kernel  with  respect 
to  its  third  argument  are  given  by 


fc{r,y,t;n[<f>(r,y)]}  =  Beta  (y; nj/2, n2/2) 


(6.75) 


•T’n.+nj  ’MM"!  +  n*?/2\jjZ-[  + 


k'{r,y,  wn[^(r,  y)]}  =  Beta  (y;  ni/2,n2/2) 


(6.76) 


•(».  +  i  +  M  r"'+"=  ( +  • 


respectively,  where 

Beta  ta»,/W2)  E  (6.77) 

For  any  fixed  r,  we  can  numerically  determine  the  location  y.(r)  of  the  peak  of  the 
kernel,  and  define 

,  ,  _  Tn2y.(r) 
q*  niO-y.(r))’ 

We  propose  doing  this  for  many  values  of  r.  Inspection  of  the  kernel  shows  that  the  peaks 
fall  on  a  nearly  straight  ridge.  Qualitatively,  the  contours  of  this  kernel  look  much  like 
Figure  5.1  of  Chapter  5.  We  make  the  approximation 


^n[<Kriy)]  *  ^"(flK^y.)]  =  V>nMr)J, 


(6.79) 
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and  we  note  that  il>n[q.(r)]  is  not  a  function  of  y  and  so  can  be  removed  from  the  integrand. 
We  have  the  following  approximate  Newton  step 


7  -  fo  k{r,y,vn[<t>(T,y)]}dy 
/o  k'{T,y,vn[<t>(r,y)]}dy 


(6.80) 


which  is  in  the  form  of  a  Conditional  Expectation  step  with  a  single  step  for  the  inner 
iteration,  and  with  the  initial  iterate  for  each  inner  iteration  identically  zero;  that  is,  we 
have  applied  the  Conditional  Expectation  algorithm  in  the  form  (4.16). 

It  is  fortunate  that  in  our  tolerance  limit  problem,  the  function  y,(r)  is  nearly  inde¬ 
pendent  of  r.  Thus,  ^n[$r,  y)]  can  be  evaluated  at  or  very  nearly  at  a  specified  grid  of 
qm  values  by  adjusting  r  after  the  nearly  constant  value  y.  of  y.(r)  is  approximated  for  a 
typical  r  value. 

One  difficulty  with  the  above  proposal  arises  from  the  fact  that,  strictly  speaking,  r 
should  only  be  taken  to  be  greater  than  one,  in  which  case  the  range  of  q .  values  is  from 
«2y./[«i(l  -  y.)] to  00  instead  of  from  0  to  oo  as  is  required  for  the  numerical  integration. 
Since  T»2y,/[«1(  1  -  y«)|  turns  out  to  be  relatively  small  we  translate  the  value  of  q.  by  this 
amount,  so  that  the  range  of  q  values  will  be  0  to  oo.  In  other  words,  we  replace  vn(qm) 
in  the  approximation 


1SSJo  k^T'y'vn^qm^dy  +  JQ  *,{T»y.®n(?-)}tfnMT,y)]rfy-  (6.81) 


by  vn{q.  —  njy./ln^l  -  y,)j}.  After  this  approximation  is  carried  out,  the  method  can 
be  iterated  using 


„n+l 


n2y* 

n2y. 

9*  «i(l-y.)J 

9*  n,(l  -  y,). 

(6.82) 


to  replace  vn. 

With  each  iteration  the  value  of  the  constant  y.  is  likely  to  change  and  should  be 
recalculated. 

The  above  simple  improvement  of  the  approximation  underlying  the  Conditional  Ex¬ 
pectation  approach  enables  one  to  calculate  tolerance  limit  factors  which  provide  very 
nearly  the  nominal  confidence  even  for  few  batches  and  small  batch  size.  Tolerance  limit 
factors  determined  by  means  of  the  Conditional  Expectation  algorithm  will  be  referred  to 
as  Conditional  Expectation  tolerance  limits. 

The  simple  Conditional  Expectation  iteration  outlined  in  this  section  is  easily  imple¬ 
mented,  and  works  astonishingly  well  for  this  difficult  (unsolvable?)  nonlinear  problem. 
Ten  or  twenty  iterations  will  usually  provide  a  smooth  tolerance  limit  factor  which  provides 
almost  exactly  the  nominal  size  for  all  values  of  the  nuisance  parameter. 

In  fact,  the  calculations  are  simple  enough  to  be  performed  interactively,  and  functions 
written  in  5  for  doing  this  are  provided  in  Appendix  D. 


6.6.1  Polynomial  Approximations  to  the  Integral  Equation  Solutions 

The  Conditional  Expectation  tolerance  limit  factors  are,  for  many  situations,  well  approx¬ 
imated  by  polynomials  in  W ,  where  W  is  defined  in  (6.60).  For  the  combinations  of  /?,  7, 
/,  and  J  most  important  for  aircraft  design  allowable  applications,  the  cubic  polynomial 

v  =  o  +  bW  +  cW 2  +  dW3  (6.83) 


119 


was  fit,  by  least  squares,  to  the  approximate  numerical  solutions  to  (6.68).  Since  the 
numerical  method  of  this  section  is  not  useful  for  the  case  of  I  =  2,  we  only  consider 
/  >  2.  The  approximate  tolerance  limit  factor  t>,  obtained  using  the  coefficients  in  Tables 
6.2  and  6.3,  provides  very  nearly  the  nominal  confidence  for  all  values  of  r. 


6.7  The  Distributions  of  the  Tolerance  Limits 

Once  the  function  v  of  Section  6.6  has  been  determined  it  is  straightforward  to  calculate 
the  cumulative  distribution  function  of  the  corresponding  tolerance  limit.  It  is  obviously 
preferable  to  compare  distributions  of  confidence  bounds  rather  than  merely  confidence 
levels,  and  we  make  such  a  comparison  in  this  section. 

Using  the  notation  of  Section  (6.6),  the  tolerance  limit  cdf  is  a  function  H(t)  given  by 

//(*;/?, r)  =  P(X  -v(Q)S  <  fi-to).  (6.84) 

For  given  v(Q),  we  would  like  /f(t;/3,r)  to  be  less  than  7  for  t  <  p  -  zpo,  greater  than 
7  for  t  >  p  -  zpi 7,  and  equal  to  7  for  t  =  p  —  zpo.  This  cdf  does  not  depend  on  p, 
and  it  depends  on  <r£  and  a\  only  through  r.  For  our  procedure  (3  is  fixed,  so  we  let 
H(t;r)  =  H(t;T,P)  and  see  how  well  we  do  compared  to  the  ideal  case  of  known  r.  Since 
this  is  just  the  function  VT(vn)  of  (6.68)  with  t?n  replaced  by  v  and  zp  replaced  by  t,  we 
are  able  to  examine  the  entire  distribution  of  the  tolerance  limit  with  little  more  effort 
than  is  required  to  calculate  the  tolerance  limit  factor. 

In  Figure  6.2,  the  cumulative  distributions  for  (.90,  .95)  Conditional  Expectation  lower 
tolerance  limits  with  /  =  J  =  5  are  presented  for  various  values  of  the  intraclass  correlation 
P  =  r/(r+  1). 

Note  that  all  of  the  curves  pass  very  nearly  through  (xp,  .95),  where  xp  =  p  -  zpox, 
indicating  the  striking  success  that  we  have  had  at  removing  the  nuisance  parameter,  even 
for  as  few  as  five  batches.  As  the  intraclass  correlation  is  increased  the  random  effects 
sample  goes  from  behaving  essentially  like  a  single  sample  of  size  n  =  IJ  when  p  =  0  to 
being  equivalent  to  a  single  batch  of  size  I  when  p  —  1. 

In  Figure  6.3  three  cdfs  are  plotted,  corresponding  to  the  Mee-Owen  method,  the 
Conditional  Expectation  tolerance  limit  and  the  solution  for  known  r  =  p  —  0.  The 
intraclass  correlation  is  taken  to  equal  zero  and  the  sample  size  is  again  I  =  J  =  5.  Note 
that  the  Conditional  Expectation  tolerance  limit  is  clearly  preferable  to  the  Mee-Owen 
solution  and  doesn’t  fare  too  badly  when  compared  to  the  known-r  solution. 


6.8  Discussion 

The  situation  of  primary  interest  to  the  aircraft  industry,  (.90,  .95)  lower  tolerance  limits, 
is  used  here  for  illustration.  The  methods  presented  in  this  chapter  include  a  Modified 
Asymptotic  Expansion  tolerance  limit  based  on  the  Welch-Aspin  expansion  (6.62),  and 
the  Conditional  Expectation  tolerance  limit  based  on  the  numerical  solution  of  an  integral 
equation  (Section  6.6).  The  confidence  for  these  two  methods  and  for  the  Mee-Owen 
method  as  a  function  of  the  intraclass  correlation  is  presented  in  Figure  6.4  for  five  batches 
each  of  size  five. 

The  various  proposed  tolerance  limit  factors,  along  with  the  factor  of  Mee  and  Owen 
(1983),  are  displayed  in  Figure  6.5.  The  Mee-Owen  tolerance  limit  factor  is  discontinuous 
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Table  6.2:  Coefficients  of  v  for  (.90,  .95)  Lower  Tolerance  Limits 


Sample  Size 

Coefficients 

I 

J 

a 

b 

c 

d 

3 

2 

1.783 

8.360 

-10.762 

6.773 

3 

3 

1.355 

2.839 

2.725 

-0.763 

3 

4 

1.369 

1.499 

5.960 

-2.672 

3 

5 

1.403 

1.051 

6.880 

-3.179 

3 

6 

1.444 

0.843 

7.118 

-3.250 

3 

7 

1.450 

0.925 

6.843 

-3.063 

3 

8 

1.442 

0.995 

6.714 

-2.995 

3 

9 

1.443 

0.981 

6.748 

-3.016 

3 

10 

1.426 

1.195 

6.275 

-2.741 

3 

00 

1.255 

1.960 

5.233 

-2.293 

4 

2 

1.820 

-1.036 

5.548 

-2.170 

4 

3 

1.604 

-0.389 

4.887 

-1.940 

4 

4 

1.559 

-0.286 

4.848 

-1.960 

4 

5 

1.550 

-0.307 

4.946 

-2.028 

4 

6 

1.542 

-0.305 

4.986 

-2.061 

4 

7 

1.531 

-0.275 

4.964 

-2.059 

4 

8 

1.520 

-0.241 

4.934 

-2.051 

4 

9 

1.508 

-0.190 

4.868 

-2.024 

4 

10 

1.484 

-0.077 

4.702 

-1.947 

4 

00 

1.281 

0.940 

3.148 

-1.208 

5 

2 

1.860 

-1.878 

5.814 

-2.389 

5 

3 

1.710 

-1.042 

4.462 

-1.723 

5 

4 

1.635 

-0.743 

4.074 

-1.559 

5 

5 

1.598 

-0.638 

3.984 

-1.537 

5 

6 

1.574 

-0.575 

3.939 

-1.531 

5 

7 

1.555 

-0.516 

3.884 

-1.516 

5 

8 

1.539 

-0.464 

3.833 

-1.501 

5 

9 

1.525 

-0.419 

3.786 

-1.485 

5 

10 

1.502 

-0.317 

3.645 

-1.423 

5 

00 

1.286 

0.707 

2.125 

-0.712 

6 

2 

1.861 

-2.064 

5.431 

-2.222 

6 

3 

1.721 

-1.024 

3.607 

-1.298 

6 

4 

1.644 

-0.747 

3.271 

-1.162 

6 

5 

1.604 

-0.654 

3.219 

-1.163 

6 

6 

1.577 

-0.591 

3.186 

-1.165 

6 

7 

1.555 

-0.533 

3.145 

-1.160 

6 

8 

1.537 

-0.486 

3.107 

-1.152 

6 

9 

1.523 

-0.447 

3.077 

-1.147 

6 

10 

1.507 

-0.387 

3.005 

-1.119 

6 

00 

1.287 

0.613 

1.564 

-0.457 
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Table  6.2:  Coefficients  of  v  for  (.90,  .95)  Lower  Tolerance  Limits 


Sample  Size 

Coefficients 

I 

J 

a 

b 

c 

d 

7 

2 

1.845 

-1.974 

4.808 

-1.924 

7 

3 

1.711 

-0.911 

2.926 

-0.970 

7 

4 

1.637 

-0.682 

2.682 

-0.881 

7 

5 

1.598 

-0.609 

2.671 

-0.905 

7 

6 

1.569 

-0.553 

2.660 

-0.920 

7 

7 

1.547 

-0.504 

2.637 

-0.924 

8 

1.530 

-0.465 

2.622 

-0.931 

7 

9 

1.515 

-0.426 

2.589 

-0.923 

7 

10 

1.498 

-0.358 

2.529 

-0.914 

7 

00 

1.287 

0.558 

1.222 

-0.311 

8 

2 

1.749 

-1.136 

2.979 

-1.010 

8 

3 

1.660 

-0.668 

2.260 

-0.670 

8 

4 

1.607 

-0.554 

2.197 

-0.668 

8 

5 

1.578 

-0.530 

2.254 

-0.721 

8 

6 

1.555 

-0.501 

2.281 

-0.753 

8 

7 

1.536 

-0.466 

2.279 

-0.767 

8 

8 

1.520 

-0.431 

2.263 

-0.771 

8 

9 

1.506 

-0.394 

2.236 

-0.766 

8 

10 

1.485 

-0.315 

2.138 

-0.727 

8 

00 

1.286 

0.520 

0.996 

-0.220 

9 

2 

1.740 

-1.068 

2.640 

-0.859 

9 

3 

1.651 

-0.611 

1.943 

-0.529 

9 

4 

1.599 

-0.511 

1.905 

-0.539 

9 

5 

1.569 

-0.490 

1.970 

-0.596 

9 

6 

1.546 

-0.463 

2.001 

-0.631 

9 

7 

1.527 

-0.429 

2.003 

-0.647 

9 

8 

1.510 

-0.395 

1.990 

-0.652 

9 

9 

1.494 

-0.347 

1.949 

-0.642 

9 

10 

1.480 

-0.308 

1.912 

-0.631 

9 

oo 

1.286 

0.490 

0.837 

-0.159 

10 

2 

1.730 

-0.992 

2.343 

-0.727 

10 

3 

1.640 

-0.556 

1.689 

-0.418 

10 

4 

1.590 

-0.471 

1.676 

-0.440 

10 

5 

1.560 

-0.452 

1.748 

-0.501 

10 

6 

1.536 

-0.426 

1.782 

-0.537 

10 

7 

1.517 

-0.395 

1.789 

-0.556 

10 

8 

1.501 

-0.363 

1.779 

-0.563 

10 

9 

1.486 

-0.322 

1.749 

-0.557 

10 

10 

1.475 

-0.301 

1.740 

-0.560 

10 

00 

1.285 

0.466 

0.720 

-0.117 
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Table  6.3:  Coefficients  of  v  for  (.99,  .95)  Lower  Tolerance  Limits 


Sample  Size 

Coefficients 

I 

J 

a 

b 

c 

d 

3 

2 

3.105 

4.815 

2.357 

0.276 

3 

3 

2.554 

2.311 

9.725 

-4.038 

3 

4 

2.543 

2.021 

10.472 

-4.484 

3 

5 

2.552 

1.857 

10.843 

-4.699 

3 

6 

2.558 

1.743 

11.104 

-4.852 

3 

7 

2.555 

1.719 

11.184 

-4.904 

3 

8 

2.550 

1.717 

11.214 

-4.928 

3 

9 

2.501 

1.948 

10.883 

-4.778 

3 

10 

2.468 

2.480 

9.619 

-4.014 

3 

00 

2.269 

3.024 

9.363 

-4.104 

4 

2 

2.933 

-0.544 

7.263 

-2.610 

4 

3 

2.608 

0.125 

6.989 

-2.680 

4 

4 

2.613 

-0.082 

7.464 

-2.952 

4 

5 

2.648 

-0.362 

7.984 

-3.228 

4 

6 

2.671 

-0.543 

8.324 

-3.410 

4 

7 

2.681 

-0.646 

8.529 

-3.523 

4 

8 

2.622 

-0.391 

8.201 

-3.390 

4 

9 

2.622 

-0.074 

7.463 

-2.970 

4 

10 

2.554 

-0.062 

7.712 

-3.162 

4 

oo 

2.310 

1.071 

6.064 

-2.403 

5 

2 

2.919 

-1.441 

6.702 

-2.439 

5 

3 

2.608 

-0.129 

4.949 

-1.687 

5 

4 

2.615 

-0.310 

5.339 

-1.902 

5 

5 

2.655 

-0.617 

5.890 

-2.187 

5 

6 

2.682 

-0.830 

6.283 

-2.393 

5 

7 

2.696 

-0.961 

6.535 

-2.529 

5 

8 

2.673 

-0.895 

6.477 

-2.514 

5 

9 

2.606 

-0.314 

5.540 

-2.090 

5 

10 

2.542 

0.192 

4.248 

-1.241 

5 

00 

2.323 

0.634 

4.351 

-1.567 

6 

2 

2.905 

-1.421 

5.433 

-1.856 

6 

3 

2.587 

0.036 

3.356 

-0.917 

6 

4 

2.601 

-0.198 

3.827 

-1.168 

6 

5 

2.637 

-0.505 

4.387 

-1.457 

6 

6 

2.662 

-0.723 

4.794 

-1.671 

6 

7 

2.676 

-0.868 

5.075 

-1.822 

6 

8 

2.556 

-0.075 

3.805 

-1.224 

6 

9 

2.592 

-0.350 

4.271 

-1.451 

6 

10 

2.616 

-0.695 

4.918 

-1.777 

6 

00 

2.328 

0.484 

3.356 

-1.106 
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Table  6.3:  Coefficients  of  v  for  (.99,  .95)  Lower  Tolerance  Limits 


Sample  Size 

Coefficients 

I 

J 

a 

b 

c 

d 

7 

2 

2.833 

-0.830 

3.638 

-1.000 

7 

3 

2.568 

0.265 

2.138 

-0.330 

7 

4 

2.588 

-0.042 

2.743 

-0.647 

7 

5 

2.616 

-0.336 

3.300 

-0.939 

7 

6 

2.635 

-0.542 

3.701 

-1.152 

7 

7 

2.647 

-0.686 

3.989 

-1.307 

7 

8 

2.575 

-0.041 

2.770 

-0.663 

7 

9 

2.575 

-0.421 

3.664 

-1.177 

7 

10 

2.562 

-0.265 

3.317 

-0.972 

7 

oo 

2.330 

0.416 

2.720 

-0.824 

8 

2 

2.728 

-0.005 

1.697 

-0.067 

8 

3 

2.557 

0.465 

1.219 

0.113 

8 

4 

2.576 

0.103 

1.941 

-0.267 

8 

5 

2.595 

-0.174 

2.493 

-0.561 

8 

6 

2.608 

-0.365 

2.881 

-0.770 

8 

7 

2.617 

-0.503 

3.165 

-0.925 

8 

8 

2.585 

-0.380 

3.006 

-0.857 

8 

9 

2.624 

-0.671 

3.528 

-1.127 

8 

10 

2.506 

0.535 

0.822 

0.492 

8 

00 

2.330 

0.378 

2.284 

-0.638 

9 

2 

2.642 

0.758 

-0.035 

0.778 

9 

3 

2.551 

0.620 

0.527 

0.446 

9 

4 

2.566 

0.224 

1.338 

0.014 

9 

5 

2.578 

-0.038 

1.883 

-0.280 

9 

6 

2.585 

-0.213 

2.256 

-0.484 

9 

7 

2.573 

-0.264 

2.416 

-0.583 

9 

8 

2.592 

-0.435 

2.737 

-0.751 

9 

9 

2.558 

-0.218 

2.390 

-0.588 

9 

10 

2.546 

-0.323 

2.664 

-0.744 

9 

00 

2.330 

0.354 

1.968 

-0.509 

10 

2 

2.593 

1.357 

-1.453 

1.485 

10 

3 

2.549 

0.735 

0.001 

0.697 

10 

4 

2.558 

0.320 

0.878 

0.225 

10 

5 

2.562 

0.075 

1.412 

-0.068 

10 

6 

2.563 

-0.085 

1.769 

-0.266 

10 

7 

2.549 

-0.073 

1.792 

-0.287 

10 

8 

2.545 

-0.168 

2.031 

-0.427 

10 

9 

2.544 

-0.171 

2.090 

-0.482 

10 

10 

2.516 

-0.158 

2.123 

-0.500 

10 

00 

2.330 

0.336 

1.730 

-0.415 
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because  these  authors  recommend  pooling  the  data  if  Q  <  1.  For  the  most  part,  the 
differences  in  the  tolerance  limit  factors  are  not  large. 

The  integral  equation  approach  virtually  removes  the  nuisance  parameter  from  the 
problem.  The  Mee-Owen  method  has  the  disadvantage  of  being  substantially  conservative 
when  the  variance  ratio  is  small. 

From  the  rescaled  plot  of  the  coverage  probability  function  for  the  integral  equation 
solution  (Figure  6.6)  it  can  be  seen  that  for  r  >  1  the  actual  coverage  probability  differs 
from  .95  by  no  more  than  ±.001.  This  small  difference  can  be  attributed  to  the  limited 
accuracy  of  the  numerical  integration.  For  r  <  1,  however,  the  difference  in  the  actual  and 
nominal  coverage  probability  increases  substantially,  but  never  does  it  reach  a  magnitude 
that  warrants  concern  for  applications. 

Figure  6.7  illustrates  the  convergence  of  the  Conditional  Expectation  algorithm  for 
various  values  of  the  intraclass  correlation.  Note  that  for  practical  purposes  ten  itera¬ 
tions  is  adequate,  although  some  slight  improvement  can  result  from  considering  more 
iterations. 


6.9  Examples 

We  consider  two  examples  in  this  section.  The  first  example  is  a  situation  where  there  is 
considerable  between-batch  variability,  and  the  second  is  a  case  where  the  true  between- 
batch  variance  is  zero,  since  the  ‘batches’  are  artificially  constructed  from  a  simple  random 
sample. 

A  manufacturer  of  aircraft  components  always  performs  certain  mechanical  tests  on 
specimens  from  each  batch  of  composite  material.  The  data  in  Table  6.4  are  coded  tensile 
strength  measurements  made  on  five  consecutive  batches  (R.  Zabora,  personal  communi¬ 
cation,  1988).  The  results  of  an  analysis  using  the  Mee-Owen  method  and  the  methods 
of  this  chapter  are  also  presented  in  Table  6.4. 

All  of  the  tolerance  limit  methods  give  nearly  the  same  answer.  These  three  methods 
will  always  agree  in  the  limit  of  large  between-batch  variability. 

To  see  how  much  these  methods  differ  when  the  between-batch  variability  is  minimal, 
we  begin  with  a  simple  random  sample  of  180  composite  tensile  strength  measurements 
(Reese  and  Sorem,  1981).  The  normal  distribution  fits  these  data  reasonably  well,  espe¬ 
cially  in  the  tails,  so  we  proceed  to  choose  25  specimens  at  random  (with  replacement) 
from  this  set  and  to  divide  these  into  five  ‘batches’  of  size  five.  These  data  are  given,  along 
with  tolerance  limit  calculations,  in  Table  6.5.  Note  the  difference  between  the  Condi¬ 
tional  Expectation  solution  and  the  other  results.  Although  this  difference  is  a  fraction  of 
a  standard  deviation,  it  might  be  large  enough  to  be  of  engineering  importance  for  some 
applications. 

Since  the  ‘batches’  in  this  second  example  were  artificially  created,  it  is  interesting  to 
compare  the  above  random  effects  tolerance  limits  with  the  pooled  sample  tolerance  limit: 
209.93  -  1.838(18.38)  =  176.15. 
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Table  6.4:  Example  #  1  :  Coded  Strength  Measurements  From  Five  Batches 


Batch  Coded  Strength  Measurements 


1 

379  357 

390 

376 

376 

2 

363  367 

382 

381 

359 

3 

401  402 

407 

402 

396 

4 

402  387 

392 

395 

394 

5 

415  405 

326 

390 

395 

36 

S\  =  1040.84 

S] 

=  78.! 

92  b\ 

t- 

* mo 

=  3.072  A' 

kmoS  — 

337.76 

ktw 

=  3.063  X 

-  kceS  = 

337.90 

kmae  =  3.055  A' 

kmaeS  ~ 

338.04 

NOTE:  /?  =  .9,  7  =  .95.  The  subscripts  mo,  ce ,  and  mac  denote  the  Mee- 
Owen,  Conditional  Expectation  (6.83),  and  Modified  Asymptotic  Expansion 
(6.62)  tolerance  limit  factors,  respectively. 


Table  6.5:  Example  #  2  :  Artificially  Batched  Data  From  a  Simple  Random  Sample 


‘Batch’  Tensile  Strength  in  1000  psi 


1 

203.41 

209.58 

213.35 

218.56 

242.76 

O 

185.97 

190.67 

207.88 

210.80 

231.46 

3 

184.41 

200.73 

206.51 

209.84 

212.15 

160.44 

180.95 

201.95 

204.60 

219.51 

5 

174.63 

185.34 

205.59 

212.00 

225.25 

A'  =  203.93  5?  =  386.04  S]  =  325.56  a\  =  337.65 

kmo  =  2.12  X  -  kmoS  =  164.98 
kce  =  2.04  X  -  k C'S  =  166. 15 
kmae  =  1.93  X  -  kmaeS  =  168.47 


NOTE:  (3  =  .9,  7  =  .95.  The  subscripts  mo,  ce,  and  mce  denote  the  Mee- 
Owen,  Conditional  Expectation  (6.83),  and  Modified  Asymptotic  Expansion 
(6.62)  tolerance  limit  factors,  respectively. 
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Intraclass  Correlation 


Figure  6.1 


-2.0  -1.5  -1.0  -0.5 

Tolerance  Limit 
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Figure  6.2:  Distributions  of  (.90, .95) 
Conditional  Expectation  Tolerance  Limits. 


-1.0  -0.5 

Tolerance  Limit 


Figure  6.3:  Distributions  of  Tolerance  Limits  for 


U+J)/i 


Figure  6.4:  Comparison  of  Confidence  as  Functions  of 
_ the  Population  Intraclass  Correlation. 


Tolerance  limit  factor 

1.5  2.0  2.5  3.0 
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Figure  6.5:  A  Comparison  of  Various  Tolerance  Limit  Factors 


32 


Figure  6.6:  Coverage  Probability  for  Conditional  Expectation  Algorithm 
Tolerance  Limit  as  a  Function  of  the  Intraclass  Correlation. 
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Chapter  7 


An  Ill-Posed  Inverse  Problem  in 
Stereology 

7.1  Ill-Posed  Inverse  Problems  in  Applied  Science 

This  thesis  has  been  concerned  thus  far  with  describing  and  applying  the  Conditional 
Expectation  algorithm  to  the  solution  of  integral  equations  of  the  first  kind,  where  the 
known  functions  are  given  without  error.  Problems  of  this  sort  are  examples  of  ill-posed 
inverse  problems.  The  study  of  ill-posed  inverse  problems  in  applications,  where  the 
right  hand  side  is  observed  with  error,  or  where  the  right  hand  size  is  an  estimate  of  a 
probability  density,  is  receiving  increasing  attention  in  statistics  (O’Sullivan,  1986).  There 
are  many  examples  of  inverse  problems,  in  such  diverse  areas  as  geophysics,  tomography, 
water  resource  management,  and  stereology. 

Problems  involving  integral  equations  of  the  first  kind  generally  arise  when  a  quantity 
is  indirectly  observed.  For  a  linear  problem,  we  have 

[  k{x,y)f{y)dy  =  g(x),  (7.1) 

Jo 

where  g(x)  is  a  function,  observed  with  error,  which  acts  as  a  proxy  for  the  unobservable 
/(y).  The  kernel,  k(x,y),  relates  the  observable,  y,  to  the  quantity  of  interest,  /.  The 
kernel  k ,  which  we  will  always  assume  to  be  known,  is  often  a  model  for  the  response  of 
a  measuring  instrument. 

The  Conditional  Expectation  algorithm  is  rapidly  convergent  for  a  fairly  wide  class  of 
problems  and  produces  smooth  near-solutions.  Because  this  algorithm  produces  smooth 
near-solutions  (see  Section  1.2.3),  it  may  be  useful  for  certain  inverse  problems  in  applied 
science  as  well.  We  consider  next  one  such  problem,  the  classical  random  sphere  problem 
of  stereology. 


7.2  The  Random  Sphere  Problem 

Often  investigators  in  medicine,  materials  science,  and  astronomy,  among  other  fields,  are 
faced  with  the  following  situation.  Observations  are  made  on  a  two-phase  material  where 
the  first  phase  consists  of  spheres  of  random  radius,  and  these  spheres  are  randomly 
distributed  in  a  second  phase.  Examples  include  stars  in  a  globular  cluster  (Wicksell, 
1926),  tumor  cell  nuclei  in  a  mouse  liver  (Keiding  et.  al.,  1972),  and  air  bubbles  in 
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polystyrene  (Meisner,  1967).  The  distribution  of  the  radii  of  the  spheres  is  desired,  but 
data  are  available  only  on  the  radii  either  of  circular  projections  or  else  of  sections  of  these 
spheres:  for  example,  circular  cross  sections  of  tumors  measured  from  a  thin  slice  of  a 
dissected  organ. 

This  problem  was  apparently  first  correctly  modeled  by  Wicksell  (1925).  A  large 
literature  has  its  origin  with  this  Wicksell  article,  including  a  wide  variety  of  solution 
techniques.  The  interested  reader  can  begin  with  the  reviews  of  Anderssen  and  Jakeman 
(1974),  Jakeman  and  Anderssen  (1974),  Cruz-Orive  (1983),  and  Colman  (1989).  For 
purposes  of  practical  stereology,  the  Wicksell  problem  has  been  largely  solved,  but  ‘its  very 
simple  structure  makes  it  a  perfect  vehicle  for  testing  numerical  and  statistical  procedures’ 
(Coleman,  1989,  p.  244)  .  So  this  problem  is  a  natural  one  to  consider,  and  we  begin 
by  introducing  some  of  the  theory  for  a  class  of  integral  equations  to  which  the  random 
sphere  equation  belongs. 


7.3  Singular  Integral  Equations  of  Abel  Type 


Abel’s  integral  equation,  in  its  simplest  form,  is 


/(y) 

(z  -  y)l/2 


dy  =  g(x). 


(7.2) 


This  is  a  weakly  singular  Volterra  equation.  An  equation  is  said  to  be  singular  if  either 
the  kernel  is  singular,  the  range  of  integration  is  unbounded,  or  both  (Porter  and  Stirling, 
1990,  Chapter  9).  A  weak  singularity  is  of  the  form  (z  -  y)~a  for  0  <  a  <  1.  The  Abel 
equation  appears  in  the  solution  of  the  brachistochrone  problem  with  which  the  calculus 
of  variations  began  (e.g.,  Weinstock,  1974,  pp.  19,  28-29),  and  so  it  is  of  considerable 
historical,  as  well  as  practical,  importance. 

To  solve  the  equation  (7.2)  analytically,  we  apply  the  operator 


to  both  sides,  giving 


f(y) 

(z  -  y)1/2 


dv 


[ x  ds  [•  f(t)dt  fx  g(s)ds 

‘o  (z-s)»/2/0  (s-t)1/2  Jo  (x-s)V2- 


(7.3) 


(7.4) 


Interchanging  the  order  of  integration  on  the  left  hand  side  of  (7.4),  we  have 


jxf{t)dt 

Jo  J 

fz  ds  I"1  g(s)ds 

't  (x  -  s)^2(s  —  t)1/2  Jo  (x  —  s)1/2 

(7.5) 

The  change  of  variable 

s  —  zsin2  9  +  t  cos2  9 

(7.6) 

gives 

r  ds 

(7.7) 

It  {x-s)'l2(s-ty'2 

The  solution  to  (7.2)  can  now 

be  seen  to  be 

1  d  r  9(t)dt 

1  xdxJo  (z-01/2‘ 

(7.8) 
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Various  generalizations  of  (7.2)  are  possible,  the  most  important  being  replacing  the 
exponent  1/2  in  the  denominator  of  the  integrand  of  (7.2)  with  any  a  €  (0, 1),  for  which 
we  have  the  inversion  formula  (Porter  and  Stirling,  1990,  p.  293) 


/(*) 


sin(ax)  d  fx  g(t)dt 

t  dx  Jo  (i  — 


(7.9) 


To  use  either  of  the  inversion  formulas  (7.8)  or  (7.9)  numerically,  one  must  perform 
numerical  differentiation.  Algorithms  for  solving  Abel  integral  equations  numerically  by 
means  of  the  inversion  formula  use  devices  such  as  spectral  differentiation  and  smoothing 
to  deal  with  the  well-known  difficulties  inherent  in  numerical  differentiation.  Iterative 
algorithms,  on  the  other  hand,  exploit  the  smoothing  capability  of  the  kernel  itself,  and 
do  not  require  explicit  inversion  formulas. 


7.4  The  Wicksell  Solution  to  the  Random  Sphere  Problem 


An  argument  in  geometric  probability  leads  to  an  Abel  equation  for  the  random  sphere 
problem.  We  follow  here  the  presentation  of  the  conditioning  argument  given  by  Nychka 
et.  al.  (1984). 

Consider  a  single  sphere  of  radius  R,  where  R  is  a  random  variable  with  density  /(r). 
Condition  on  the  radius  R  =  r,  and  let  this  sphere  be  cut  at  random  by  a  plane,  i.e.  let 
the  distance  U  from  the  cutting  plane  to  the  center  of  the  sphere  be  uniform  on  [0,  r], 
and  let  the  radius  of  a  cross-sectional  circle  (a  profile  radius)  be  denoted  by  the  random 
variable  X.  Let  E  denote  the  event  that  a  sphere  cut  by  the  plane  has  radius  r.  It  is  easy 
to  see  that 


GX\e{z)  =  P(X  <  x\ E)  =  P(U  >  sfT^\ E) 


1  -  y/r2  -  x2/r 
1 
0 


for  0  <  x  <  r 
for  x  >  r 
for  x  <  0 


(7.10) 


The  probability  that  a  sphere  will  be  cut  by  a  given  plane  depends  on  its  radius,  r. 
Thus  the  conditional  density  of  the  radii  of  spheres  given  that  they  are  cut  by  a  specified 
plane  changes  from  /(r)  to  l(r),  which  is  proportional  to  r/(r).  To  see  this,  consider  an 
infinite  population  of  spheres  intersecting  a  given  plane.  Replace  each  cut  sphere  by  the 
diameter  which  is  orthogonal  to  the  intersecting  plane.  The  probability  density  that  a  cut 
sphere  has  radius  r  is  clearly  the  ratio  of  the  total  length  of  diameters  of  length  2r  to  the 
total  length  of  all  diameters,  i.e. 


K  \  _  2r/(r)  _  r/(r )  _  rf(r) 

{)  2  f?zf(z)dz  Joro  zf(z)dz  ~  p 


(7.11) 


where  p  is  defined  to  be  mean  sphere  radius,  and  To  is  the  largest  observable  profile  radius. 

Let  F  and  G  denote  the  cumulative  distributions  of  sphere  and  profile  radii,  respec¬ 
tively,  and  let  I{A)(t)  denote  the  indicator  function  (2.66)  for  the  set  A.  We  have 


G(x)  =  [1  -  I{T>:){rWr2  -  x2/r]rf(r)dr 

1  fr°  t - 

=  1 - /  \Jr2  -  x2f(r)dr. 

P  Jx 


(7.12) 
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Differentiating  both  sides  of  (7.12)  with  respect  to  x  gives  an  Abel  equation  relating  the 
density  of  profile  radii  to  the  density  of  sphere  radii: 


r-Q 


x  f(r)dr 
Vt2  -  x * 


(7.13) 


Because  /i  is  the  mean  of  a  random  variable  (the  sphere  radius)  having  density  /(x),  the 
equation  (7.13)  is  actually  nonlinear.  We  will  see  below,  though,  that  this  nonlinearity 
does  not  introduce  any  serious  difficulties. 


7.5  The  Conditional  Expectation  Algorithm  for  the  Ran¬ 
dom  Sphere  Problem 

We  now  apply  the  Conditional  Expectation  algorithm  (4.16)  to  (7.13),  and  consider  several 
numerical  examples.  First  we  discuss  how  to  transform  (7.13)  into  an  equation  for  which 
the  limits  of  integration  are  constant,  and  then  we  define  the  modified  algorithm.  We 
begin  the  iteration  with  f°  =  g(x). 

Let  the  mean  of  the  sphere  radius  density  /n(x)  be 


/in  =  [  xfn(x)dx. 
Jo 


(7.14) 


We  now  replace  the  functional  n  in  (7.13)  with  the  constant  pn.  The  result  is  a  linear 
integral  equation,  and  a  Conditional  Expectation  algorithm  step  for  this  equation  is 


,  fT°  xdr  I'1  n  fT°  xfn(r)dr] 

{X)  =  [L  y/r2-x*\  V9[X)~L 


(7.15) 


=  jx  log  (^y/r2  -  x2  +  r0)  -  log(z)  j  fing{x)  -  ^3==^ 


Following  the  approach  in  Appendix  A  for  discretizing  Volterra  equations  (A.3),  we 
simplify  the  computation  by  changing  the  variable  of  integration  so  that  the  quadrature 
points  can  be  chosen  independently  of  x.  To  do  this,  we  make  the  change  of  variable  to 
w  where 

r  =  (ro  -  x)w  +  x.  (7.16) 

If  the  integral  of  the  kernel  is  denoted 


q(x)  =  x  Jlog  ( \Jt 2  -  x2  +  r0  j  -  log(z)  , 

then,  after  the  change  of  variable  (7.16), 

rr°  x(r0  -  x)/n((r0  -  x)w  +  x)dw 


hn(x)  = 


nng(*)  ~  f 

Jo 


0  vVo  -  x)2w2  +  2wx(r0  -  x)  J 


q(x). 


(7.17) 


(7.18) 


We  have  transformed  the  Volterra  equation  (7.13)  into  a  series  of  integral  equations 
corresponding  to  all  with  kernel 


k(x,w)  = 


x(r0  -  x) 


g(x)v/(ro  -  x)2w2  +  2tnx(ro  -  x)’ 


(7.19) 
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for  (x,tn)  €  [0,1]  x  [0,1].  This  kernel  has  a  singularity  along  the  line  {(x,tn)|ttf  =  0). 
Figure  7.1  is  a  plot  of  (7.19).  Note  that  k(x,w)  drops  off  very  rapidly  with  increasing  w 
for  each  x  and  that,  if  hn  is  evaluated  along  the  singular  line,  then 

hn((r0  -  x)w  +  1)1^0  =  hn(x).  (7.20) 


We  can,  without  loss  of  generality,  let  r0  =  1,  since  this  is  equivalent  to  choosing  a  suitable 
unit  of  length. 

We  now  outline  an  approach,  based  on  the  Conditional  Expectation  algorithm,  for 
simultaneously  approximating  /  and  /z.  Let  /"  be  a  given  sphere  radius  density,  which 
we  will  regard  as  an  approximation  to  a  solution  /  to  (7.13).  Corresponding  to  this  /", 
there  is  a  mean  sphere  radius  /zn  and  a  profile  radius  density  gn.  We  can  easily  determine 
the  product  fingn: 

$"(*)  =  mV(*)  =  £  ydy.  (7.21) 

We  know  that  </"(x),  a  probability  density,  must  integrate  to  one.  The  mean  radius  fxn  is 
therefore  the  normalizing  constant: 


x/n(r)dr 

7t^ 


dr 


dx. 


(7.22) 


From  /n  we  determine,  successively,  gn ,  /i",  gn ,  and  hn  (where  hn  is  given  by  (7.18)). 
Now  we  can  calculate  /n+1  =  /"  +  hn.  Assume,  for  the  moment,  that  /n+1  is  positive. 
Since  /n+1  need  not  be  a  probability  density,  we  let 


r+i(x)  = 


fn+1(x) 

fo/n+1(y)dy} 


(7.23) 


and  continue  the  iteration. 

If  fn+1(x)  <  0  for  some  x  values,  the  simplest  thing  to  do  is  to  replace  fn+1  with 
max(/n+1,0)  before  performing  the  normalization  (7.23).  This  approach  is  usually  ade¬ 
quate,  and  it  has  been  followed  in  the  examples  of  this  chapter. 

Another  approach  is  to  replace  (7.13)  with  the  nonlinear  equation 


?(*) 


1  y1  xe^T^dr 
H  Jx  y/r2  -  x2 


(7.24) 


The  Newton-step  equation  corresponding  to  (7.24)  is  easy  to  determine,  and  the  corre¬ 
sponding  Conditional  Expectation  quasi-Newton  step  is 


_j££x)_ 

fl  ’ 

Jx 


(7.25) 


which  cannot  be  negative  for  any  x. 

The  nonlinear  iteration  based  on  (7.25)  is  closely  related  to  the  EM  algorithm  iteration 
for  this  problem  (Silverman,  et.  al.,  1990),  and  to  an  iterative  algorithm  recently  proposed 
by  Vardi  (1992).  However,  the  iterates  from  this  nonlinear  algorithm  tend  to  be  less 
smooth  than  those  from  the  Conditional  Expectation  algorithm,  particularly  where  the 
denominator  in  (7.25)  is  small. 
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7.5.1  Density  Estimation  Issues 


In  practice,  we  are  almost  never  given  a  density  g(x).  Instead,  we  have  profile  radius 
measurements,  and  the  first  order  of  business  is  to  estimate  their  density.  For  those 
situations  where  raw  radius  data  is  available,  Taylor  (1982)  recommends  using  a  (variable 
bandwidth)  Rosenbiatt  kernel  estimator. 

If  only  a  histogram  of  profile  radii  is  available,  then  this  histogram,  once  normalized, 
can  be  used  as  a  piecewise  constant  estimate  of  5(1).  Alternatively,  one  can  interpolate 
between  the  points  with  abscissas  at  the  midpoints  of  the  histogram  intervals  and  ordinates 
given  by  the  normalized  counts  in  the  corresponding  cells. 

We  discuss  in  some  detail  the  use  of  a  piecewise  constant  estimate  of  g.  Let  the 
endpoints  of  the  ith  cell  of  the  normalized  histogram  be  /,  <  u,,  for  i  -  1  ,...,m,  and 
denote  the  estimate  of  g(x)  for  x  €  [/,,«,)  by  g,.  Then,  from  (7.13)  we  have,  for  each  t, 
that 


where 


g,  ~ 

L*i)m 

/‘il,  Jx 

X^Lirdx 

v/r2  -  i2 

= 

J  ki(r)dr , 

0 

0  <  r  <  l{ 

k,(r)  =  < 

- 1} 

li  <  r  < 

1 

CN 

-  \]r2  -V 

if  i/,  <  r  <  1 

(7.26) 


(7.27) 


where  we  have  set  ro  =  1.  The  kernel  (7.27)  is  continuous  in  r  but  discrete  in  x.  Al¬ 
though  this  kernel  is  not  singular,  it  does  have  a  peak  where  r  =  u,  and  the  Conditional 
Expectation  algorithm  can  still  be  successfully  applied. 


7.6  Numerical  Examples 

We  begin  by  considering  a  simple  example  for  the  solution  is  known: 

/fl(r)  =  /(r)  =  6r(l-r),  (7.28) 

where  R  is  the  sphere  radius,  and  fi  =  E(R)  =  1/2.  The  corresponding  profile  radius 
density,  g( z),  is  (Anderssen  and  Jakeman,  1974,  p.  136) 

g{ x)  =  6  jx\/l  -  x2  -  x3  log[x-1  +  x~2  -  1]  j  .  (7.29) 

We  will  consider  the  cases  where  g  is  observed  with  and  without  error;  and  for  the  case 
where  g  is  noisy,  we  will  examine  the  effects  of  smoothing.  We  will  also  cons’der  the  case 
where  instead  of  the  right  hand  side  being  a  function  observed  with  error,  we  are  given  a 
random  sample  of  profile  radii  from  the  density  (7.29).  Following  this,  we  will  attempt  to 
invert  (7.13)  for  a  real  data  set  on  cross  sections  of  liver  cell  nuclei)  (Keiding,  1972).  The 
computations  were  programmed  in  S  (Becker,  Chambers  and  Wilks,  1988),  and  the  code 
is  included  in  Appendix  E. 
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7.6.1  Sphere  Radius  Density  /(r)  =  6r(l  —  r) 

Let  /(r)  be  given  by  (7.28)  and  let  g(x )  be  given  by  (7.29),  where  g(x)  might  be  observed 
with  error.  We  discretize  this  problem  as  in  Appendix  A  by  evaluating  x  and  r  each 
at  25  Gauss- Legendre  quadrature  points  and  then  we  apply  the  Conditional  Expectation 
algorithm  (4.16)  (see  the  discussion  in  Section  7.5)  for  ten  iterations. 

We  consider  first  the  case  where  g  is  observed  without  error  (except  for  computer 
roundoff  and  discretization  error),  and  no  smoothing  is  performed.  The  results  of  ten 
iterations  of  the  algorithm  for  this  situation  is  presented  in  Figure  7.2a-b.  The  heavy  line 
in  the  Figure  7.  2a  is  the  true  solution,  and  the  heavy  line  in  Figure  7.2b  is  the  true  right 
hand  side.  Note  that  the  algorithm  converges  rapidly  to  the  solution  to  the  problem,  and 
that  the  computations  have  apparently  not  been  substantially  effected  by  roundoff  error. 

Next,  we  introduce  error  in  g.  It  turns  out  that,  unless  the  noise  level  is  low,  some 
smoothing  of  the  profile  radii  is  helpful.  Within  the  5  package,  it  is  convenient  to  use 
the  function  ‘ smooth ’,  which  is  an  implementation  of  the  ‘4(3RSR)2H  twice’  smoother 
(Velleman  and  Hoaglin,  1981).  Of  course,  for  a  real  application,  careful  consideration 
must  be  given  to  the  density  estimation  process.  However,  it  is  our  intention  in  this 
section  to  demonstrate  that  the  Conditional  Expectation  algorithm  is  potentially  useful 
for  certain  inverse  problems  in  applied  science,  so  we  will  not  be  concerned  much  with  the 
details  of  density  estimation. 

Let  Z,,  i  =  1, . . . ,  25  be  iid  standard  normal  random  variables.  At  each  value  <7(2:,)  of 
g(x)  in  the  discretized  problem,  we  introduce  relative  error  by  the  relationship 

g{ii)  =  g(xi){  1  +  cZ,).  (7.30) 

We  sometimes  smooth  the  g(xi)  by  one  pass  of  the  ‘4(3RSR)2H  twice’  smoother.  The 
Conditional  Expectation  algorithm  is  then  applied  with  no  further  smoothing. 

In  Figures  7.3  and  7.4  we  ‘roughen’  the  profile  radius  measurements  by  using  equation 
(7.30)  with  e  =  .001  and  c  =  .01  respectively.  No  smoothing  was  done,  and  it  is  obvious 
from  the  results  that  no  smoothing  was  necessary.  Although  noise  levels  within  the  range 
considered  here  may  seem  small  (in  fact,  the  perturbed  <7(1)  looks  quite  smooth  to  the 
eye),  it  is  worth  noting  that  we  have  demonstrated  that  the  inversion  algorithm  is  useful 
with  only  two  significant  digits  of  accuracy. 

In  Figures  7.5  and  7.6  we  show  the  results  from  e  =  .1,  both  with  and  without  smooth¬ 
ing  of  g.  In  Figures  7.7  and  7.8  we  examine  the  extreme  case  of  e  =  .25,  again  with  and 
without  smoothing.  It  is  significant  that  even  for  these  noisy  examples,  where  the  per¬ 
turbed  g(x)  is  visibly  rough,  the  algorithm  performs  reasonably  well. 

7.6.2  Sphere  Radius  Density  f(r)  =  6r(l  —  r):  Sampling  from  <7(1) 

It  is  straightforward  to  randomly  sample  from  the  density  g(x).  To  do  so,  begin  by 
choosing  a  sphere  radius  at  random  from  the  density  /( r );  that  is,  from  a  Beta  (2,  2) 
density.  Let  this  selected  sphere  radius  be  rj.  Next,  choose  a  center  for  this  sphere  from 
the  uniform  density  on  [0,1];  let  the  chosen  center  be  c}.  Let  the  plane  of  the  profile  sections 
be  at  1.  If  Cj  +  r,  >  1,  then  the  jth  sphere  has  been  cut,  and  we  can  proceed  to  select  a 
profile  radius.  If  Cj  +  rj  <  1,  then  the  jth  sphere  was  too  far  from  the  plane  to  be  sectioned, 
and  the  selected  rj  does  not  provide  a  profile  radius.  Let  dj  =  Cj  +  rj  -  1.  If  dj  >  0,  then 
simple  geometry  shows  that  the  desired  profile  radius  is  qj  =  Jr2-  -  (rj  -  dj)2. 


141 


For  a  numerical  example,  we  selected  1000  sphere  radii,  of  which  515  were  cut  by 
the  sectioning  plane,  resulting  in  515  random  draws  from  the  density  (7.29).  A  50-cell 
histogram  of  these  515  profile  radii  is  given  in  Figure  7.9.  We  will  explain  below  the  solid 
and  broken  lines  in  this  figure.  As  a  check,  we  have  that  the  average  of  these  515  radii 
is  x  =  .4773  with  a  standard  error  of  .0092,  which  is  less  than  one  standard  deviation 
greater  than 


J  xg(x)dx  =  J  6x  |x\/l  —  x2  —  x3log[x  1  +  \/x-2  —  l]j  dx  =  .4712. 


(7.31) 


We  now  use  this  histogram  to  estimate  the  density  (7.29)  as  follows.  First  we  form 
two  vectors:  an  abscissa  vector  of  the  midpoints  of  the  histogram  cells  in  Figure  7.9, 
and  an  ordinate  vector  of  the  cell  counts.  These  two  vectors  determine  a  piecewise  linear 
function  which  we  take  to  be  our  right  hand  side  g.  Next,  we  evaluate  this  piecewise 
linear  interpolant  at  the  quadrature  points  for  25-point  Gauss-Legendre  quadrature.  We 
integrate  the  resulting  function,  and  normalize  it  so  as  to  provide  a  density  estimate. 
The  estimate  which  results  is  superposed,  suitably  scaled,  on  the  histogram  as  a  solid 
piecewise  linear  function.  The  broken  line  results  from  applying  one  pass  of  the  smoother 
‘4(3RSR)2H  twice’  to  the  solid  line.  We  will  refer  to  the  broken  line  as  a  ‘smoothed 
density  estimate’. 

In  Figure  7.10a,  successive  approximations  to  the  solution  are  displayed  for  the  smoothed 
density  estimate  in  Figure  7.9.  The  results  are  disappointing;  there  is  an  unwanted  peak 
near  x  =  0.  This  difficulty  does  not  occur  in  the  real  data  example  of  the  following 
subsection,  but  it  is  not  yet  understood.  Silverman  et.  al.  (1990)  comment  on  the 
same  phenomenon  when  they  treat  this  random  sphere  problem  using  a  ‘smoothed  EM’ 
approach.  In  Figure  7.10b  we  see  evidence  of  the  ili-posed  nature  of  our  problem:  the 
unreasonable  approximation  with  the  unwanted  peak  still  gives  a  right  hand  side  close  to 
the  smoothed  profile  density  estimate. 


7.6.3  A  Real  Data  Example:  Liver  Cell  Nuclei 

We  now  consider  a  real  example,  taken  from  Keiding  et.  al.  (1972).  The  function  <7(1) 
consists  of  smoothed  midpoints  of  a  histogram  of  liver  cell  nuclei  profile  radii  (Keiding, 
1972,  p.  823). 

In  Figure  7.11,  this  histogram  is  displayed  along  with  the  estimate  of  f(r)  which 
the  Conditional  Expectation  algorithm  provides.  The  sphere  density  estimate  becomes 
slightly  negative  for  small  r;  it  is  truncated  to  zero  in  the  figure.  Of  course,  an  estimate 
of  a  profile  radius  density  obtained  from  real  data  need  not  correspond  to  any  sphere 
density  under  our  idealized  model.  Density  estimates  which  are  negative  in  places  are 
to  be  expected  with  real  data  from  any  algorithm  unless  the  algorithm  constrains  the 
solution  to  remain  positive.  Overall,  though,  the  sphere  radius  density  estimate  looks 
reasonable,  and  it  compares  favorably  with  estimates  for  this  (and  other)  datasets  in  the 
stereology  literature. 
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Figure  7.1 :  Random  Sphere  Kernel  After  Change  of  Variable 


Figure  7.2a:  Approximations  to  Solution  (epsilon-0) 
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Figure  7.5a:  Approximations  to  Solution  (epsilon-.l ,  no  smoothing) 
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Figure  7.7a:  Approximations  to  Solution  (epsilon^ 
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Figure  7.8a:  Approximations  to  Solution  (epsilon-.25,  smoothing) 
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Figure  7.9:  A  Sample  from  a  Profile  Radius  Density 
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Chapter  8 

Conclusions 


The  focus  of  this  thesis  has  been  on  a  simple  iterative  algorithm  for  integral  equations 
of  the  first  kind  with  positive  kernels,  which  we  have  called  the  Conditional  Expectation 
Algorithm.  A  study  of  a  numerical  algorithm  for  solving  deterministic  equations  might 
be  regarded  more  as  work  in  numerical  analysis  than  as  statistics.  However  there  are 
numerous  connections  with  this  work  to  statistics,  some  of  which  are: 

1.  The  first  instance  of  an  approximation  to  the  Conditional  Expectation  algorithm 
appears  in  the  attempt  of  Trickett  and  Welch  (1954)  to  find  a  similar  test  for  the 
Behrens-Fisher  problem.  The  Trickett- Welch  algorithm  apparently  converged  to  a 
"^ooth  ‘solution’  for  a  problem  which  has  either  none  or  only  pathological  solutions. 

2.  The  Conditional  Expectation  algorithm  has  been  applied  successfully  to  a  difficult 
problem  in  one-sided  ^-content  tolerance  limits  for  a  balanced  one-way  ANOVA 
model,  a  problem  which  is  of  some  importance  in  engineering  statistics.  The  result 
is  a  new  method  (Vangel,  1992)  which  provides  tolerance  limits  with  confidence  level 
virtually  independent  of  nuisance  parameters.  Like  the  Behrens-Fisher  problem,  this 
problem  most  likely  has,  at  best,  pathological  exact  solutions. 

3.  Many  other  applications  to  problems  in  mathematical  statistics  which  can  be  for¬ 
mulated  as  integral  equations  of  the  first  kind  are  clearly  possible. 

4.  The  Conditional  Expectation  algorithm  has  been  shown  to  be  potentially  useful 
for  certain  inverse  problems  of  indirect  measurement.  This  usefulness  has  been 
demonstrated  by  means  of  a  classical  inverse  problem  of  stereology.  The  statistical 
analysis  of  inverse  problems  is  an  area  of  considerable  interest  to  statistics. 

5.  The  name  ‘Conditional  Expectation  algorithm’  was  chosen  to  emphasize  the  prob¬ 
abilistic  motivation  for  the  method.  We  have  introduced  the  notion  of  stochastic 
preconditioning ,  a  process  which  transforms  any  positive,  bounded  kernel  into  a 
conditional  density.  Each  step  of  the  proposed  algorithm  then  constitutes  the  con¬ 
ditional  expectation  of  the  true  discrepancy  of  a  solution  from  the  current  iterate, 
with  respect  to  this  density. 

In  Chapters  5-7  of  this  thesis  we  consider,  in  succession,  the  Behrens-Fisher  problem, 
a  random  effects  tolerance  limit  problem,  and  an  inverse  problem  in  stereology.  All  of 
these  problems  involve  solving  nonlinear  ill-posed  integral  equations  of  the  first  kind,  all 
are  of  statistical  interest,  and  all  are  successfully  treated  by  the  Conditional  Expectation 
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algorithm.  This  algorithm  has  also  been  useful  in  several  other  examples,  but  we  have 
chosen  not  to  report  on  them  here.  Instead,  we  have  attempted  to  explain  why  the 
remarkably  simple  algorithm  which  we  have  proposed  often  works  so  well  on  problems 
formulated  as  ill-posed  integral  equations. 

In  order  to  attempt  to  answer  this  question  we  have  made  a  long  detour  into  numerical 
and  functional  analysis,  with  some  interesting  results: 

1.  Sufficient  and,  in  some  cases,  necessary  conditions  for  the  convergence  of  Richard¬ 
son’s  algorithm  for  singular  matrix  equations  have  been  established  for  singular  and 
nonsingular  matrix  equations.  These  general  theorems  provide  insight  into  why  such 
algorithms  can  still  perform  well  when  applied  to  inconsistent  equations. 

2.  Several  motivations  for  the  stochastic  preconditioning  which  leads  to  the  Conditional 
Expectation  algorithm  have  been  developed.  One  of  these  heuristic  motivations 
suggested  the  name  for  the  algorithm. 

3.  One  peculiarity  of  iterative  algorithms  applied  to  ill-posed  integral  equations  is  that 
these  algorithms  can  produce  smooth  near- solutions  without  requiring  explicit  regu¬ 
larization.  There  must  be  regularization  implicit  in  iteration,  and  this  regularization 
arises  because  the  kernel  of  the  equation  tends  to  smooth.  We  have  made  this  idea 
precise  by  showing  that  each  step  in  a  linear  iteration  minimizes  a  quadratic  form, 
which  is  a  sum  of  two  terms.  The  first  term  measures  how  close  the  present  iterate 
is  to  the  vector  to  which  it  is  converging,  and  the  second  term  (for  discretizations 
of  integral  equations  with  smooth  kernels)  tends  to  penalize  ‘rough’  iterates. 

We  conclude  by  suggesting  three  directions  for  future  research.  The  first  is  into  the 
numerical  aspects  of  the  Conditional  Expectation  algorithm,  an  area  in  which  our  results 
have  been  incomplete.  This  is  work  for  a  numerical  analyst.  The  second  two  areas  are  of 
more  statistical  interest: 

1.  Apply  the  Conditional  Expectation  algorithm  to  other  problems  in  statistical  method¬ 
ology.  These  problems  include  similar  tests  and  confidence  intervals  in  normal  the¬ 
ory  problems  (such  as,  for  example,  confidence  intervals  for  linear  combinations 
of  variance  components).  Another  promising  class  of  problems  in  empirical  Bayes 
methodology  concerns  estimating  a  prior  density  on  a  parameter  given  an  estimate 
of  the  marginal  density  of  the  data. 

2.  Explore  the  role  of  the  Conditional  Expectation  algorithm  in  other  applied  inverse 
problems,  perhaps  in  image  processing. 

This  thesis  had  its  origin  in  1985,  with  a  chance  encounter  with  Trickett  and  Welch 
(1954),  an  article  virtually  ignored  in  the  statistical  literature.  Some  points  of  interest 
along  the  path  followed  since  then  are  summarized  in  this  document.  There  is  much  more 
to  be  seen;  a  fundamental  understanding  of  how  the  Trickett- Welch  and  related  algorithms 
work  their  magic  is  still  lacking.  Perhaps  others  will  take  up  the  trail. 
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Appendix  A 

A  Setup  for  Numerical  Problems 


To  iteratively  approximate  the  solution  of  an  integral  equation  numerically,  we  replace 
the  integral  equation  by  an  approximating  system  of  algebraic  equations.  We  then  apply 
a  version  of  Richardson’s  algorithm  to  the  solution  of  this  system  of  equations. 

All  integral  equations  which  we  will  consider  are  special  cases  of 

[  k{x,y,f[4>(x,y)])dy  =  g(x),  (A.l) 

Jo 

where  k,  g,  and  <f>  :  [0, 1]  x  [0, 1]  — ►  [0, 1]  are  known  functions.  By  choosing  a  mesh  of  x 
and  y  values,  a  quadrature  rule,  and  an  interpolation  rule  for  /,  equation  (A.l)  can  be 
replaced  with  a  system  of  nonlinear  equations. 

Note  that  (A.l)  allows  for  the  possibility  that  the  kernel,  k,  is  nonlinear  in  the  un¬ 
known,  /.  When  this  is  the  case,  we  will  linearize  this  equation,  thus  reducing  the  numer¬ 
ical  problem  to  one  of  approximately  solving  a  linear  integral  equation,  with  a  different 
kernel,  at  each  iteration. 

We  discuss  first  a  numerical  setup  suitable  for  the  general  equation  (A.l),  and  then 
we  consider  important  special  cases.  Although  we  present  this  discretization  scheme  for 
an  integral  equation  on  the  unit  square,  extension  to  other  regions  is  straightforward. 


A.l  The  General  Setup 

Let  {j/j}j=1  be  quadrature  abscissas,  and  let  {wj}j=l  denote  the  quadrature  weights.  We 
choose  r  points  at  which  the  unknown  function,  /,  is  to  be  determined,  and  we  let  these 
points  be  We  define  the  x,  and  yj  values  to  be  ordered: 

0  <  xi  <  x2  <  •••  <  xr  <  1 

and 

0  <  yi  <  y2  <  •••  <  yc  <  1. 

We  will  use  iterative  algorithms  to  solve  r  equations  in  r  unknowns.  Unless  <A(x,y)  = 
y,  c  =  r,  and  x*  =  y,  for  i  =  1, . . . ,  r,  it  is  necessary  to  specify  a  function  /  which 
approximates  /  by  interpolation,  and  possibly  also  by  extrapolation.  We  will  give  the 
details  of  /  defined  by  linear  interpolation,  and  we  will  use  this  definition  exclusively  in 
numerical  examples. 

For  any  point  z*  —  4>(xi,yj)  such  that  x,  <  z“  <  x,+  i,  let 

/(»•)  =  /(*.) + /(I;+l)  ~  f(I,V  -  x.)-  (a.2) 

X,+  l  -  X, 

If  z*  <  ij  or  z*  >  xr,  then  it  is  necessary  to  specify  an  extrapolation  rule.  This  will 
usually  be  required  if  yi  <  xj  or  yc  >  xT.  Let  xq  =  0  <  xj  and  xr+i  =  1  >  xr.  If  xq  /  xj, 
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assume  /( 0)  is  known.  Similarly,  if  ir+i  #  xr,  assume  /( 1)  is  known.  This  situation  does 
not  arise  in  this  thesis;  if  it  did  one  could  choose  a  quadrature  rule  which  did  not  require 
/(0)  or  /(l). 

For  2*  <  x  i,  define 


,  .  f(xi)~  f(xo),_.  _  , 

/(*)  =  /(*  l)+  - - - U 


and  for  zm  >  xr,  let 


Xl 


7,  ,\  _  n  \  ,  /(* r  +  1 )  _  \ 

f(z)  =  f(xr)  +  - ^ - (2  -Xr). 


(A.3) 


(A.4) 


When  discussing  discrete  approximations  to  integral  equations,  we  will  let  /  and  g 
denote  the  r-dimensional  vectors 


/(*  i) 
/(**) 


L  /(*r) 


and 


'  9(x i)  ’ 
g(x  2) 


5  = 


L  <7(*r)  J 


We  define  /  to  be  the  rc-dimensional  vector 


f\<H*  1 » yi )] 
/[</>(  1 1,^2)] 


(A.5) 


(A.6) 


(A.7) 


L  /[^(*r,yc)] 


where  the  function  f(z)  is  given  by  the  formulas  (A.2)-(A.4)  in  terms  of  the  elements  of 
the  vector  f.  All  of  the  elements  of  /  need  not  be  distinct.  For  example,  if  4>(x,  y)  =  y 
and  r  =  c,  then  there  will  only  be  r  distinct  elements  in  /. 

There  exists  an  rc  x  r  matrix  A/,  which  is  implicitly  defined  by  (A.2)-(A.4),  such  that 


f  =  Mf,  (A.8) 

but  we  will  not  make  any  explicit  use  of  this  matrix. 

We  can  now  replace  the  equation  (A.l)  by  the  approximating  system  of  equations 

C 

gi{f)  =  5^*{*i,y>,/M*i,yi)]}wj  =  9i,  (A.9) 

j=i 


for  i  =  1,. . .  ,r. 
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In  numerical  examples,  we  will  restrict  attention  to  the  case  r  =  c,  y;  =  *,,  and 
Gauss-Legendre  quadrature,  wherever  possible. 

A  preconditioned  Richardson  algorithm  for  the  system  of  equations  (A.9)  is 

x  =  1  +  r  forxl  -  5(/")rx l]  ,  (A.10) 


where  the  dimensions  of  the  matrices  are  as  indicated,  6  is  a  positive  constant,  and  D  is 
a  nonsingular  matrix.  We  would  like  to  choose  D  to  accelerate  convergence. 

The  reason  why  the  matrix  M  in  (A.8)  is  never  needed  should  now  be  clear.  The 
difference  /n+1  -  /"  involves  /"  only  through  the  residual g  -  g(fn). 

If  the  elements  of  g(f)  are  linear  in  the  elements  of  /,  that  is,  if  the  interpolation  rule 
is  linear  and  if  the  kernel,  Jfc(x,y,/),  is  linear  in  the  function  /,  then  it  is  always  possible 
to  write 

9(f)  =  A7,  (A. 11) 

for  some  r  x  r  matrix  A',  although  it  can,  in  general,  be  tedious  to  determine  the  elements 
of  K.  If  g  is  nonlinear  in  /,  then  we  will  linearize  the  system  of  equations  (A.9)  at  each 
iteration. 

Important  special  cases  of  this  general  setup  are  discretizations  for  linear  Fredholm 
and  linear  Volterra  equations.  We  discuss  next  the  specific  discretizations  used  for  all 
linear  Fredholm  and  Volterra  numerical  examples  in  this  thesis. 


A.2  A  Linear  Fredholm  Example 

Consider  the  Fredholm  equation  of  the  first  kind 

f  fc(*,3/)/(y)<fy  =  0(*)-  (A. 12) 

Jo 

We  will  use  the  setup  of  Section  A.l,  with  r  =  c  and  x ,  =  y,  for  t  =  1, . . . ,  r.  Let  {*,},%_ j  be 
the  abscissa  values  for  Gauss-Legendre  quadrature,  and  let  {u>j}j=j  be  the  corresponding 
weights.  Define  the  typical  elements  of  the  r-dimensional  vectors  /  and  g  by 

/,  =  /(x.)  (A. 13) 

and 

9,  =  g(xi ),  (A. 14) 

respectively.  Note  that  /  has  at  most  r  distinct  values.  The  ith  element  of  g(f)  is 

C  C 

9i(i)  =  Ylk(x"yj)hi- i)c+jwj  =  Y2k(x"yi)f>wi'  (a.is) 

.7  =  1  j  =  l 

for  i  =  1  ,...,r.  We  can  write,  for  this  example,  Kf  =  y,  where  the  typical  element  of  K 
is 

Kij  =  k(xi,yj)wj.  (A. 16) 

A  preconditioned  Richardson  algorithm,  for  this  discretization  of  a  linear  Fredholm 
equation,  is  then  of  the  form  (3.3). 


A. 3  A  Linear  Volterra  Example 

Consider  the  Volterra  equation  of  the  first  kind 

/  k(x,u)f(u)du  =  g{x). 
Jo 


(A.17) 
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Since  the  upper  limit  of  integration  of  (A. 17)  is  not  constant,  (A. 17)  is  not  in  the  general 
form  of  (A.l).  However,  the  change  of  variable 


u  =  xy  (A. 18) 

results  in  the  following  integral  equation  with  fixed  limits  of  integration: 

/  xk(x,xy)f(xy)dy  =  g(x).  (A. 19) 

Jo 

Tf*  wp  Ipf 

<f>(x,y)  =  xy  (A.20) 

and 

*{*,  y,  f[<Kx  1  y)]}  =  **(*,  xy)f{xy ),  (A.21) 

then  (A. 19)  is  a  special  case  of  (A.l). 

As  with  the  Fredholm  example,  we  will  use  the  setup  of  Section  A.l,  with  r  =  c  and 
x,  =  y.  for  i  =  1, . . .  ,r.  Let  {x;}[=1  be  the  abscissa  values  for  Gauss-Legendre  quadrature, 
and  let  i  be  the  corresponding  weights.  Define  the  tth  element  of  the  r-dimensional 

vector  g  by 

9i  =  (A.22) 

We  will  assume  /( 0)  is  known,  and  define  the  r2  x  1  vector  /  by  equations  (A.2)-(A.4). 
The  elements  of  g(f)  are  given  by 


9i(f)  —  y!  2ik(xt,  Xjyj  )f(xjyj)wj,  (A. 23) 

j=i 

for  i  =  l,...,r.  Although  {g{f)i  =  where  g  is  given  by  (A.23),  is  a  system  of  r 

linear  equations  in  r  unknowns,  it  is  not  easy,  and  certainly  not  necessary,  to  find  a  matrix 
K  so  that  this  system  can  be  written  in  the  form  I{ f  =  g.  Richardson’s  algorithm  can  be 
used  directly,  in  the  form  (A. 10). 

The  approach  to  the  discretization  of  Volterra  equations  presented  in  this  subsection 
will  be  used  exclusively  for  the  Volterra  numerical  examples  in  this  thesis.  However,  it 
is  important  to  note  that  by  transforming  a  Volterra  kernel  (on  a  triangular  domain) 
into  a  kernel  on  the  unit  square,  we  have  violated  a  notion  of  causality  inherent  in  the 
Volterra  equation.  The  value  of  the  right  hand  sizey(z)  depends  only  on  {/(y)|y  <  x},and 
conversely,  the  solution  f(y)  depends  only  on  {y(x)|x  <  y).  Our  numerical  treatment  does 
not  preserve  this  property.  In  this  sense  what  we  are  doing  is  unconventional.  Although 
it  has  worked  well  for  the  examples  considered,  we  are  not  in  a  position,  at  this  time,  to 
advocate  it  as  a  general  approach. 
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Appendix  B 

S  Code  for  Conditional 
Expectation  Algorithm  and 
Richardson  Algorithm  for  the 
Green’s  Function  Kernel 


The  following  function,  written  in  the  S  programming  language,  (Becker,  Chambers  and 
Wilks,  1988),  can  be  used  for  solving  the  integral  equation  with  kernel  (4.30).  The  function 
gauleg  calls  a  FORTRAN  routine  to  determine  Gauss- Legendre  quadrature  abscissas  and 
weights,  and  is  documented  in  Appendix  C. 

Note  that  to  solve  other  linear  integral  equations  on  the  unit  square,  only  the  functions 
kernel  and  rhs  need  to  be  modified.  This  is  therefore  a  very  useful  function  for  exploring 
iterative  algorithms  for  linear  integral  equations. 

* 

*  Nark  Vangel  Sept  1990 

* 

#  Richardson  and  Conditional  Expectation  algorithms  lor  Green’s  function 

#  example. 

# 

inteqn_function(niter=10 ,  npt=25,  xl=0,  xh=l,  norm=T,  fct=l, 
jac=F,  land=F){ 


* 

#  —  Matrices  of  successive  corrections,  sums  of  corrections, 

#  and  approximate  r.h.s 
h.matrix  (0,  niter,  npt) 
f  _h 

v_h 

# 

•  —  Gauss-Legendre  abscissas  and  \weights 
gl_gauleg  (npt,  xl,  xh) 

* 

•  —  Arrays  of  x  and  y  values  at  which  functions  are  evaluated 
x.matrix  (glSx,  npt,  npt,  byrow=T) 

#  x_(-6*x**3+9*x**2-x)/2 
y_matrix  (gl$x,  npt,  npt) 

* 

•  —  Array  of  ueights  for  numerically  integrating  kernel  w.r.t.  y 
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*y .matrix  (gl$s,  npt,  npt) 

• 

#  —  Kernel,  integral  oi  kernel  s.r.t.  y,  and  rha 
k.kernel  (x,  y)  *wy 

n_l  /apply  (k,  2,  ’sun’) 
if  (non  »*  F)  u_fct*u/u 
if  (jac  *«  T)  u_l/diag(kernel(x,y) ) 
g_rhs  (x[l,]) 
if  (land  =*  T){ 
g_  t(k)  •/,*•/.  g 

k_  (t(k)  %*7.  kernel(x,y))  *sy> 

* 

#  —  First  approximation  to  solution 

f Cl.]_g 

* 

#  —  lorn  calculate  the  successive  corrections 
for  (i  in  1: (niter-1))  { 

v[i.]  _t(k)  •/.♦*/.  fti.] 

h[i,]  _u  *  (g  -v[i,]) 
f Ci+l,]_f [i,]+h[i,] 

> 

inteqn.list  (h,  f,  v,  g,  kernel(x,y)) 
names (inteqn)_c( ’h’ ,  ’t‘ ,  ’g*.  ’rhs’,  *k') 
return  (inteqn) 

> 

« 

#  Calculate  kernel  at  gauss  points 

# 

kernel.* unction  (x,  y)  { 
i_y<x 

kernel,  i  *y  *(l-x)  +(l-i)  *x  *(l-y) 
return  (kernel) 

> 

* 

*  Right  hand  side  of  equation 

t 

rhs.function(x)  { 
rhs_x**3*(l-x)**2 
return  (rhs) 

> 
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Appendix  C 


S  Code  for  The  Trickett- Welch 
Algorithm  and  Conditional 
Expectation  Algorithms  for  the 
Behrens-Fisher  Problem 


C.l  The  Trickett- Welch  Algorithm 


The  following  function  is  written  in  the  5  programming  language  (Becker,  Chambers  and 
Wilks,  1988).  The  only  required  call  to  an  external  FORTRAN  routine  is  the  function 
gauleg  which  calculates  Gauss- Legendre  quadrature  points  and  weights.  This  FORTRAN 
routine  was  obtained  from  Press,  et.  al.  (1986,  pp.  125-126),  and  is  not  reproduced 
here.  The  FORTRAN  routines  mydbeta,  mydt  and  mypt  are  double  precision  versions  of 
the  single  precision  S  functions  dbeta,  dt  2nd  pt  respectively.  These  FORTRAN  functions 
were  obtained  from  Griffiths  and  Hill  (1985),  but  satisfactory  results  for  most  applications 

can  probably  be  obtained  from  the  corresponding  5  functions. 

* 

t  Nark  Vang el  Nay  1991 

« 

#  Solve  Behrens-Fisher  integral  equation 

#  (Algorithm  as  in  Trickett-Velch  (1954)) 

« 

bftw_function(niter=10,  npt=25,  nquad=25,  »*pl=100, 
con?-  35,  initf =qnorm(conf ) ,  n=c(5,5))  { 

#  —  Natrices  of  successive  corrections;  sums  of  corrections; 

#  and  approximate  r.h.s. 
h  .matrix  (0,  niter,  npt) 

f  _h 
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v  _h 
r  _h 

# 

#  —  d.f.  of  variance  estimates;  r.h.s  for  equation;  initial  guess  for 

#  critical  value, 
df  _n-l 

g  .matrix (conf,  npt,  1) 

f [l,]_initf 

h[l,3_0 

# 

*  —  Gauss-Legendre  abscissas  and  weights;  beta  density  evaluated  at 

*  abscissas . 
gl_gauleg  (nquad ,  0,  1) 

b  .mydbeta  (gl$x,  df[l]/2,  df[2]/2) 

* 

#  —  Arrays  of  x,  y  and  beta  values  at  which  functions  are  evaluated. 

#  The  beta  values  are  freighted  by  the  quadrature  weights. 

x. matrix  ((0; (npt-1) )/(npt-l) ,  nquad,  npt,  byrow=T) 
b.matrix  (b,  nquad,  npt)  ‘matrix  (gl$w,  nquad,  npt) 

y. matrix  (gl$x,  nquad,  npt) 

# 

#  —  Argument  for  t  density  and  cdf. 

arg_sqrt((df [l]+df [2] )  *{x*y/df [l]+(l-x)*(l-y)/df [2])) 

* 

#  —  Argument  for  critical  value  statistic. 

2_x*y/df[l]  / (x*y/df Cl]  +(l-x)*(l-y)/df [2]) 

# 

#  —  Evaluate  the  gradient  at  the  mean  of  the  beta  density 
xpeak.  df [l]/(df [l] +df [2] ) 

ipeak.  sum  (gl$x  <  xpeak) 
w  _  gl$x[ipeak+l] -xpeak 

# 

#  —  lov  calculate  the  successive  corrections 
for  (i  in  1: (niter-1))  { 

sp  _spline(x[l,] ,  f[i,],  n=nspl) 

mf  _approx(sp$x,  sp$y,  c(z))$y 

mf  .matrix (mf,  nquad,  npt) 

k0  .rnypt  (arg  *mf,  df[l]+df[2])  *b 

kl  .rnydt  (arg  *mf,  df[l]+df[2]) 

drv  _kl[ipeak,]*(l-w)  +klCipeak+l,]*w 

v[i,]  .apply  (kO,  2,  ’sum') 

h[i+l,]_-(v[i,]-g)/drv 
f Ci+l,]_f Ci,]  +h[i+l ,] 

> 

r_g  -vCniter-1,] 

bf.list  (x[l,],  h,  f,  v,  kl,  r) 
names(bf  )_c(  ’x’ ,  ’h\  *f’,  ’g',  *kl\  ’r’) 
return  (bf) 

} 

gauleg.f unction  (n,  xlow=-l,  xhgh=l)  ■( 

* 

*  Mark  Vangel,  Sept  1990 

* 

•  Calculate  Gauss-Legendre  abscissas  and  weights 
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* 

* 

x.matrix(0,n,l) 

w.x 

z_. Fortran ("gauleg", as. double  (xlow) , 
as .double  (xhgh) , 
x  =  as. double  (x) , 
v  =  as. double  (w), 
as. integer  (n)) 

z_list(z$x,  z$w) 
names  (z)_  c(*x,,’w’) 
gauleg_z 
> 

C.2  The  Conditional  Expectation  Algorithm 


The  following  function  is  written  in  the  5  programming  language  (Becker,  Chambers 
and  Wilks,  1988).  For  information  on  the  use  of  external  FORTRAN  routines,  see  the 

comments  preceeding  the  code  in  Appendix  C.l. 

* 

#  Mark  Vangel.  Jan  1991 

* 

#  Solve  Behrens-Fisher  integral  equation 

* 

bl_lunction(niter=10,  npt=25,  nquad=25,  nspl=100, 

conf=.95,  initl=qnorm(conl ) ,  n=c(S,5))  { 

* 

*  —  Matrices  o 1  successive  corrections;  sums  o 1  corrections; 

*  and  approximate  r.h.s. 
h  .matrix  (0,  niter,  npt) 
f  _h 

v  _h 
r  _h 

* 

#  —  d.l.  ol  variance  estimates;  r.h.s  lor  equation;  initial  guess  lor 

#  critical  value, 
dl  _n-l 

g  _matrix(conl ,  npt,  1) 

lCl,]_initl 

hCl,3.0 

# 

*  —  Gauss-Legendre  abscissas  and  veights;  beta  density  evaluate^  at 

*  abscissas. 

gl .gauleg  (nquad,  0,  1) 
b  .mydbeta  (gl$x,  dl[l]/2,  dl[2]/2) 

* 

#  —  Arrays  ol  x,  y  and  beta  values  at  which  lurctions  are  evaluated. 

#  The  beta  values  are  weighted  by  the  quadrature  weights, 

x.matrix  ((0: (npt-1) )/(npt-l) ,  nquad,  npt,  byrow=T) 
b.matrix  (b,  nquad,  npt)  *matrix  (gl$w,  nquad,  npt) 
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y_»atrix  (gl$x,  nquad,  npt) 

—  Argument  tor  t  density  and  cdf. 

arg_sqrt (  (d  2  [1]  +dl  [2] )  * (x*y/d2  [1]  ♦  ( 1-x) *  ( 1-y )  /d  2  [2] ) ) 

—  Argument  tor  critical  value  statistic. 
z_x*y/d2[l]  / (x*y/d2 [l]  +(l-x)*(l-y)/d2[2] ) 

—  low  calculate  the  successive  corrections 
2 or  (i  in  1: (niter-1))  { 

sp  _spline(x[l,3 ,  2[i,],  n*nspl) 

■2  _approx(sp$x,  sp$y,  c(z))$y 

■1  _natrix(m2 ,  nquad,  npt) 

kO  _mypt  (arg  *ml,  df [1] +dl [2] )  *b 

kl  _«ydt  (arg  *ai,  di[l]+d2[2])  *b  *arg 

drv  -apply  (kl,  2,  'sum') 

v Ci ,3  -apply  (kO,  2,  ’sum’) 

h[i+l,]_-(v[i,3-g)/drv 
2[i+l,]_2[i,]  +h[i+l,] 

> 

r_g  -v [niter- 1,] 
b2_list  (x[lj,  h,  2,  v,  kl,  r) 
na»es(b2)_c('x\  ’h’,  ‘2*,  »g\  'kl*.  *r') 
return  (b2) 

> 
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Appendix  D 


5  Code  for  The  Coditional 
Expectation  Algorithm  for  the 
Tolerance  Limit  Problem 


The  following  function  is  written  in  the  S  programming  language  (Becker,  Chambers  and 
Wilks,  1988).  The  function  gauleg  calls  a  FORTRAN  routine  as  documented  in  Appendix 
C.  The  function  dtnc  calls  a  FORTRAN  subroutine  to  determine  the  noncentral-t  density 

and  cummulative  (Lenth,  1988) 

* 

*  Mark  V angel,  September  1990 

# 

#  Function  to  determine  critical  values  lor  tolerance  limit  problem. 

# 

tw.f unction  (niter=10,  i=5 ,  j=5 ,  npt=25,  nquad=25,  p=.9,  g=.95, 
k0=welch  (r,  p,  g,  i,  j,  accel=F), 
kfact=3.406632){ 

* 

#  —  Degrees  ol  freedom  between,  within,  total 
df l_i-l 

d!2_i*( j-1) 
df  _dfl+df2 

* 

*  —  Matrices  of  successive  corrections;  sums  of  corrections; 

*  and  approximate  r.h.s. 
h  .matrix  (0,  niter,  npt) 
f  _h 

v  _h 

* 

*  —  formal  quantile,  beta  function, 
z  .qnorm  (p) 

con_lgamma((df l+df2)/2)  -lgamma(df 1/2)  -lgamma(df2/2) 

* 

•  —  Gauss-Legendre  abscissas  and  weights 

gpt_gauleg(nquad,  0,  1) 
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x  _matrix(gpt$x,  apt,  nquad,  byrow=T) 

« 

#  —  luisance  parameter  values 
r  _gl$x/(l-gl$x) 
tau_matrix(r,  apt,  aquad)  +1 

« 

#  —  Limiting  value  lor  1 
klim  _klact 

l[l.]_kO 

# 

#  —  Iterate  quasi-lewton  algorithm 
lor  (i  in  1: niter)  { 

* 

#  —  Determine  maximum  value  ol  argument  o 1  1 

xm  _x[l,npt] 
tm  _tau[npt,l] 

rm  _tm  *d!2  *xm  /(dll  *(l-xm)) 
rlim  _rm 

* 

#  —  Interpolate  1 

msr  _tau  *d!2  *x  /(dll  *(l-x)) 
k  _array(approx  (c(r,rlim), 

c(l[i,] ,klim), 

c(msr),  rule=2)$y,  dim(msr)) 

* 

#  —  Evaluate  V_1  and  lind  the  peak 

arg  _k  *sqrt  ((dll+d!2)*(i*x/(i-l)  +(l-x)/tau)) 
ncp  _z  *sqrt  (i*(l  +(j-l)/tau)) 
u  _dtnc ( arg ,  ncp,  dll+d!2) 

beta_exp(con  +(dll/2-l)*log(x)  +(d!2/2-l)*log(l-x)) 
browser () 

vl  _  beta  *u$dens  *arg  /k 
v[i,]  _  (beta  *u$cdl)  '/,*'/,  gpt$w 
vmax.apply (vl , 1 , ’max ’ ) 
vmax_apply(vl<=vmax,  1,  ’sum’) 
xmax.x [10 , vmax [10] ] 
browser () 

* 

#  —  Transform  the  nuisance  parameter  estimate 

tau2_tau  *dll  *(l-xmax)/(d!2  *xmax) 

» 

#  —  Determine  new  maximum  value  ol  argument  ol  1 

xm  _x[l ,npt] 
tm  _tau2[npt,l] 
rm  _tm  ed!2  *xm  /(dll  *(l-xm)) 
rlim  _rm 

# 

*  —  Interpolate  1  again 

msr  _tau2  *d!2  *x  /(dll  *(l-x)) 
k  _array(approx  (c(r,rlim), 

c(l[i,] ,klim), 

c(msr),  rule=2)$y,  dim(msr)) 

# 

#  —  Calculate  next  step 
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arg  _k  *sqrt  ((dfl+df2)*(i*x/(i-l)  +(l-x)/tau2)) 
ncp  _z  *sqrt  (i*(l  +(j-l)/tau2)) 
u  _dtnc(arg,  ncp,  dfl+df2) 

beta.exp(con  +(df l/2-l)*log(x)  +(df2/2-l)*log(l-x)) 
vl  _  (bata  *u$dens  *arg  /k)  */,**/.  gptSw 
vO  _  (bata  *u$cdf )  •/,♦*/,  gpt$w 

# 

h[i.]  _(g  -vO)  /vl 

f [i+l.]_f Ci,]  +h[i,3 

> 

# 

#  —  Return  arg  of  k,  old  k,  new  k,  F(k) 

tw.list  (h,  f,  v) 
names(tw)_c( *h’ ,  ’f’,  ’g’) 
raturn(tw) 

> 


welch_function(msra,  p=.9,  g=.95,  k=5,  1=5)  { 

# 

#  Welch-Aspin  series  critical  value  for  tolerance  limit  problem 

* 

xkp  _qnorm(p) 
xkg  _qnorm(g) 
len  .length (msra) 
msr  .array  (msra,  len) 
n  _k*l 

tl  _sqrt  (l/(l+(l-l)/msr)) 

t2  _sqrt  (1/ (msr *2  +(l-l)*msr)) 

rtk  _sqrt  (k) 

rtn  _sqrt  (n) 

xll  .1/1-2 

xl2  _((1-1)/1)  ‘2 

* 

xk  _xkp  +tl/rtn  *(xkg  +l/(4*(k-l))  *( 

xkg  *(xkg*xkg  +1)  +xkp*xkp*xkg  *n  *tl*tl  exll 

+xkp  *rtn  *tl*tl*tl  *xll  +xkp*xkg*xkg  *rtn  *tl  /l) 
♦l/(4*k*(l-l))  *( 

+xkp*xkp*  xkg  *n  *t2*t2  *xl2  +xkp  *rtn  *t2*t2*t2  ♦xl2)) 

# 

idx_sum(( 1 : length (xk) )*  (xk==min(xk) ) ) 
if  (idx  >  1)  xk[l:idx-l]_xk[idx] 

if  (sum  (dim  (msra))  !=  0)  xk.array  (xk,  dim(msra)) 
return  (xk) 

> 

dtnc.f unction  (ta,  ncpa,  df) 

{ 

f 

#  Mark  Vangel,  Sept.  1990 

* 

#  loncentral-t  density  and  cdf 

« 


n  .length(ta) 
tnc  .array (0,  n) 

fault.array(0,  n) 
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t  _array(ta,  n) 
ncp  _array(ncpa,  n) 

# 

z 1. . Fortran ( "tnc 1 " ,  as . double ( t  *  sqrt ( (di +2 ) /di ) ) , 
as . double (di +2 ) , 
as. double (ncp) , 
fault=as. integer (fault) , 
tnc=as .double (tnc) , 
as.integer(n)) 

# 

z2_. Fortran ( "tnc 1”,  as.double(t) , 
as. double (df ) , 
as.double(ncp) , 
f ault=as . integer (fault) , 
tnc=as . double (tnc) , 
as. integer (n)) 

* 

p_df/t  *(zl$tnc-z2$tnc) 
if  (sun(dui(ta))  ! =  0)  n_dim(ta) 
p  .array (p,  n) 
tnc  .array (z2ttnc,n) 
f aultl_array(zl$f ault ,n) 
f ault2_array (z2$f ault ,n) 
z_list(tnc,  fault) 
z_list(p,  tnc,  faultl+fault2) 
naaies (z).c(’ dens' ,  'cdf,  'fault') 
return  (z) 

* 

> 
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Appendix  E 

S  Code  for  Conditional 
Expectation  Algorithm  for 
Random  Sphere  Problem 


The  following  function,  written  in  the  5  programming  language,  (Becker,  Chambers  and 
Wilks,  1988),  can  be  used  for  solving  the  integral  equation  (7.13).  The  function  gauleg 
calls  a  FORTRAN  routine  to  determine  Gauss- Legendre  quadrature  abscissas  and  weights, 
and  is  documented  in  Appendix  C. 

spheres_lunction(rhs ,  npt=25,  niter=25,  xl=0,  xh=l)  { 

#  Nark  Vangel  lov  1990 

#  Stereology  problem 

h  .matrix  (0,  niter,  npt) 
f  _h 
v  _h 

gl .gauleg  (npt,  xl,  xh) 
x  .matrix  (gl$x,  npt,  npt,  byrow=T) 
y  _t(x) 

v  .matrix  (glSu,  npt,  npt) 

« 

k  _x  *(xh-x)/  sqrt  (y**2*(xh-x)**2  +2*x*y  *(xh-x)) 

# 

f Ci.D.o 

lx  _c(xl,  gl$x,  xh) 

d  _gl$x  *(log(xh  +8qrt(xh  -gl$x**2))  -log(gl$x)) 
tp  _c  ((l-x)*y  +x) 

# 

lor  (i  in  1: (niter-1))  { 
ly  _c(0,  1 [i,3 ,  0) 

Iz  .approx  (lx,  ly,  xout=tp,  rule=2) 

v[i,3  .apply  (k*w*matrix(lz$y,  npt,  npt),  „ ,  ’sum') 

h[i,3  _(rhs-v[i,])  /d 
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t  [i+l,3_<CiJ  +fc[i,3 
* Ci+i,J_f Ci+l,3*(f Ci+i.3>0) 

> 

apbaraa.liat  (h,  1,  v,  rbs,  k) 
naaaa(apharaa)_c(,b' ,  't‘ ,  ‘g’,  ’u’,  *k’) 
ratura  (apbaraa) 

> 
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